The Math Behind the Models
You learned this and then didn't think about it for years — dot products, matrix shapes, softmax, ReLU. This puts the intuition back, one draggable figure at a time, and ends right where the Attention walkthrough begins. Nothing here is harder than a class you've already taken.
1 · The atom
The dot product measures alignment
A vector is just a list of numbers — here, an arrow. The dot product is the one operation everything else is built from. Drag the arrows: pointing the same way makes it large and positive, perpendicular makes it zero, opposite makes it negative. That's "how much do these two agree?" — and it's exactly the attention score from the other walkthrough.
2 · A matrix is a verb, not a table
A matrix transforms space
The trick that makes matrices click: a matrix is a function that bends space in straight lines. Its columns say where the two basis arrows land. Drag the four numbers and watch the grid rotate, stretch, and shear. The determinant is how much area got scaled (negative = flipped).
3 · Why the dimensions line up
Matrix multiplication, decoded
You remember the rule — (m×n)(n×p) = (m×p), inner dims must match — but here's why: every output cell is a dot product of a row from the left with a column from the right. They must be the same length (that shared n) for the dot product to exist. Step through the cells and watch each one get built.
4 · The bend
Activation functions add the curve
Stack matrices and you still only get one big matrix — a flat, linear map. The activation function is the nonlinear squash between layers that lets a network learn curves. Drag the input and compare the classics: ReLU = max(0, x) (cheap, the default), sigmoid squashes to (0, 1), tanh to (−1, 1).
5 · Put them together
A neuron is a dot product + a squash
One neuron computes : take the dot product of the input with a weight vector, add a bias, run it through an activation. The weights define a line (the decision boundary); the activation colors which side is "on." Drag the weight arrow and the bias to move and tilt the boundary.
6 · From scores to a decision
Softmax turns scores into probabilities
Finally, softmax: exponentiate a vector of scores and normalize so they sum to 1 — a "soft argmax." Drag the bars to set the raw scores; the bottom row is the probability the model assigns each option. The temperature sharpens (cold → confident, one winner) or flattens (hot → unsure). These are precisely the attention weights — so now you're ready for Attention Is All You Need.
That's the toolkit: a dot product measures agreement, a matrix bends space, multiplying them is rows-times-columns of dot products, activations add the curves, a neuron is a dot product plus a squash, and softmax turns scores into a decision. Next: see them all working at once in Attention, or watch one learn in Gradient Descent and Neural Boundary.