A Plain Walkthrough of Transformer Models
February 16, 2026
This work is licensed under CC BY-NC-SA 4.0.
Front Matter

What’s Special About This Book?
Honestly, nothing. It’s just another book on transformers—and there are many excellent ones out there.
So why write another one? I’ve noticed that much of the confusion learners have about the core of transformers—the “attention” mechanism—traces back to the name itself.
Most courses treat “attention” as more than just a name—they build analogies on top of it (e.g., a word “attends to” other words) and explain transformer models without a consistent level of abstraction (e.g., the subject of “attend” alternates between words and models; the “self” in self-attention refers to the input rather than a word. etc.). If you’re already feeling confused, this is why.
In this book, I try to keep analogies minimal. I treat “attention” as just a name—nothing more—and explain things without complex analogies that complicate rather than clarify. Hopefully, this approach will work for some learners.
What Is This Book About?
This book is still a work in progress. Here’s what I have planned:
Part 1: The One Mechanism
How text becomes numbers (Chapter 1), how tokens exchange context through relevance-weighted averaging (Chapter 2), and the full mathematical walkthrough of attention (Chapter 3).
Part 2: The Twofold Training
How these models acquire their capabilities. Pre-training on vast text corpora (Chapter 4) and fine-tuning for specific tasks (Chapter 5).
Part 3: The Three Wizards
The major architectural families. Encoders (Chapter 6), decoders (Chapter 7), and everything in between (Chapter 8).
Part 4: The Appendices
Practical matters. Computational considerations (Chapter 9), evaluation methods (Chapter 10), and modern methods (Chapter 11).
Prerequisites
You’ll benefit from having a basic grasp of ML concepts. Familiarity with neural networks, understanding that “training” is about adjusting parameters for improvement, and knowing that matrices are just collections of numbers will help. If you come across terms that puzzle you, like “softmax” or “gradient”, don’t hesitate to look them up.