Notes
Longer things I've built while working through a topic: runnable notebooks, structured write-ups, book-length guides. Meant to be worked through or dipped into, not read start to finish. Looking for dated essays instead? See blog.
-
How a Transformer Learns Addition
A decoder-only transformer built component by component, trained on three-digit addition. Every piece (token embeddings, self-attention, MLPs, layer norm, residuals) is the same as in GPT-4, just smaller and inspectable. 17,000 parameters, 50,000 training examples, roughly a weekend of reading.