Attention Is All You Need
The 2017 paper that replaced recurrence with pure attention and gave us GPT, BERT, and the rest. If you did an ML class years ago — you know dot products, softmax, and gradient descent — this rebuilds the Transformer from those parts, one interactive figure at a time. The running example is the sentence "the cat sat because it was sleepy." Rusty on dot products, matrices, or softmax? Warm up with the math refresher first.
Rung 1 · Tokens become vectors
A word is a point in space
Before anything else, each token is mapped to a vector — an embedding. Geometry encodes meaning: related words land near each other, and direction carries relationships. Real embeddings live in hundreds of dimensions; here we use 2D so you can see it. Notice "it" sits near "cat" — that proximity is what will let attention resolve what "it" refers to.
Rung 2 · Three roles per token
Query, Key, Value
Self-attention gives every token three projections of its embedding, via learned matrices WQ, WK, WV: a Query ("what am I looking for?"), a Key ("what do I offer?"), and a Value ("what I'll hand over if you pick me"). Pick a token and see its three vectors.
Rung 3 · How relevant is each token?
Attention score = query · key
To decide how much one token should pay attention to another, take the dot product of the first's Query with the second's Key. Big dot product = aligned = relevant. Choose the querying token; the bars are its raw scores against every key. With "it" as the query, "cat" scores highest — exactly the link we want.
Rung 4 · Turn scores into weights
Scale by √dₖ, then softmax
Raw scores aren't a distribution. Divide by √dk (the key dimension) so the numbers don't blow up as dimensions grow — that keeps softmax out of its flat, vanishing-gradient regime — then softmax to get attention weights that are positive and sum to 1. Drag dk: small scaling sharpens to a near one-hot pick; large scaling flattens toward "attend to everything equally."
Rung 5 · Build the new representation
Output = weighted sum of Values
The token's updated vector is the attention-weighted average of everyone's Value vectors. Relevant tokens contribute more. So "it" literally absorbs a large slice of "cat"'s value — its representation now carries the thing it refers to. That blended vector (gold) is the output of attention for this token.
Rung 6 · Do it for everyone at once
The attention matrix
Run that for every token in parallel and you get an N×N matrix: row i is how token i distributes its attention across all tokens (each row sums to 1). This single picture — who attends to whom — is the heart of the Transformer. Hover a row to read it. The "it" row lights up on "cat."
Rung 7 · Many relationships at once
Multi-head attention
One attention pattern can only capture one kind of relationship. So the Transformer runs several heads in parallel, each with its own WQ/WK/WV, then concatenates and projects their outputs. One head learns to track meaning/coreference, another local adjacency, another links to the main verb. Flip between heads:
Rung 8 · Where did word order go?
Positional encoding
Attention treats the input as a set — shuffle the tokens and the math is identical. That's a problem: "cat sat" ≠ "sat cat". The fix is to add a positional encoding to each embedding — a vector built from sines and cosines of many frequencies, so every position gets a unique, smoothly-varying fingerprint the model can read. Each row below is one position's encoding across dimensions.
Rung 9 · The whole block
Putting it together
Stack it up: embed the tokens, add positional encodings, then repeat a block — multi-head self-attention, a residual add & layer-norm, a position-wise feed-forward network, another add & norm — N times. No recurrence, so every position is computed in parallel and any token can reach any other in one step. That parallelism + one-hop reach is why it scaled to today's models.
That's the paper, ground up: tokens → vectors, three roles, dot-product relevance, scale + softmax, a weighted blend of values, an N×N matrix of it, many heads, plus positions — stacked into a block. Everything since is mostly more of this.