← Sims

Attention Is All You Need

The 2017 paper that replaced recurrence with pure attention and gave us GPT, BERT, and the rest. If you did an ML class years ago — you know dot products, softmax, and gradient descent — this rebuilds the Transformer from those parts, one interactive figure at a time. The running example is the sentence "the cat sat because it was sleepy." Rusty on dot products, matrices, or softmax? Warm up with the math refresher first.

Rung 1 · Tokens become vectors

A word is a point in space

Before anything else, each token is mapped to a vector — an embedding. Geometry encodes meaning: related words land near each other, and direction carries relationships. Real embeddings live in hundreds of dimensions; here we use 2D so you can see it. Notice "it" sits near "cat" — that proximity is what will let attention resolve what "it" refers to.

Rung 2 · Three roles per token

Query, Key, Value

Self-attention gives every token three projections of its embedding, via learned matrices W_Q, W_K, W_V: a Query ("what am I looking for?"), a Key ("what do I offer?"), and a Value ("what I'll hand over if you pick me"). Pick a token and see its three vectors.

Rung 3 · How relevant is each token?

Attention score = query · key

To decide how much one token should pay attention to another, take the dot product of the first's Query with the second's Key. Big dot product = aligned = relevant. Choose the querying token; the bars are its raw scores against every key. With "it" as the query, "cat" scores highest — exactly the link we want.

Rung 4 · Turn scores into weights

Scale by √dₖ, then softmax

Raw scores aren't a distribution. Divide by √d_k (the key dimension) so the numbers don't blow up as dimensions grow — that keeps softmax out of its flat, vanishing-gradient regime — then softmax to get attention weights that are positive and sum to 1. Drag d_k: small scaling sharpens to a near one-hot pick; large scaling flattens toward "attend to everything equally."

√dₖ scaling

Rung 5 · Build the new representation

Output = weighted sum of Values

The token's updated vector is the attention-weighted average of everyone's Value vectors. Relevant tokens contribute more. So "it" literally absorbs a large slice of "cat"'s value — its representation now carries the thing it refers to. That blended vector (gold) is the output of attention for this token.

Rung 6 · Do it for everyone at once

The attention matrix

Run that for every token in parallel and you get an N×N matrix: row i is how token i distributes its attention across all tokens (each row sums to 1). This single picture — who attends to whom — is the heart of the Transformer. Hover a row to read it. The "it" row lights up on "cat."

Hover a row to see what that token attends to.

Rung 7 · Many relationships at once

Multi-head attention

One attention pattern can only capture one kind of relationship. So the Transformer runs several heads in parallel, each with its own W_Q/W_K/W_V, then concatenates and projects their outputs. One head learns to track meaning/coreference, another local adjacency, another links to the main verb. Flip between heads:

Rung 8 · Where did word order go?

Positional encoding

Attention treats the input as a set — shuffle the tokens and the math is identical. That's a problem: "cat sat" ≠ "sat cat". The fix is to add a positional encoding to each embedding — a vector built from sines and cosines of many frequencies, so every position gets a unique, smoothly-varying fingerprint the model can read. Each row below is one position's encoding across dimensions.

Rows = positions, columns = embedding dimensions. Low dims wiggle fast, high dims slow — together they pin down position.

Rung 9 · The whole block

Putting it together

Stack it up: embed the tokens, add positional encodings, then repeat a block — multi-head self-attention, a residual add & layer-norm, a position-wise feed-forward network, another add & norm — N times. No recurrence, so every position is computed in parallel and any token can reach any other in one step. That parallelism + one-hop reach is why it scaled to today's models.

That's the paper, ground up: tokens → vectors, three roles, dot-product relevance, scale + softmax, a weighted blend of values, an N×N matrix of it, many heads, plus positions — stacked into a block. Everything since is mostly more of this.