A Plain Walkthrough of Transformer Models

Author

Last updated

February 18, 2026

Version & License

New chapters will be posted as they are finished.
This work is licensed under CC BY-NC-SA 4.0.

What’s Special About This Book?

Honestly, nothing. It’s just another book on transformers—and there are many excellent ones out there.

So why write another one? I’ve noticed that much of the confusion learners have about the core of transformers—the “attention” mechanism—traces back to the name itself.

Most courses treat “attention” as more than just a name—they build analogies on top of it (e.g., a word “attends to” other words) and explain transformer models without a consistent level of abstraction (e.g., the subject of “attend” alternates between words and models; the “self” in self-attention refers to the input rather than a word. etc.). If you’re already feeling confused, this is why.

In this book, I try to keep analogies minimal. I treat “attention” as just a name—nothing more—and explain things without complex analogies that complicate rather than clarify. Hopefully, this approach will work for some learners.

What Is This Book About?

This book is still a work in progress. Here’s what I have planned:

Part 1: The One Mechanism
How text becomes numbers (Chapter 1), how tokens exchange context through relevance-weighted averaging (Chapter 2), and the full mathematical walkthrough of attention (Chapter 3).

Part 2: The Twofold Training
How these models acquire their capabilities. Pre-training on vast text corpora (Chapter 4) and fine-tuning for specific tasks (Chapter 5).

Part 3: The Three Wizards
The major architectural families. Encoders (Chapter 6), decoders (Chapter 7), and everything in between (Chapter 8).

Part 4: The Appendices
Practical matters. Computational considerations (Chapter 9), evaluation methods (Chapter 10), and modern methods (Chapter 11).

Prerequisites

You’ll benefit from having a basic grasp of ML concepts. Familiarity with neural networks, understanding that “training” is about adjusting parameters for improvement, and knowing that matrices are just collections of numbers will help. If you come across terms that puzzle you, like “softmax” or “gradient”, don’t hesitate to look them up.

Front Matter

What’s Special About This Book?

What Is This Book About?

Prerequisites