7  Connections

NoteTakeaway

The gap between this 17,000-parameter addition model and GPT-4 is scale, not architecture. Multi-head attention, LoRA, and billion-parameter LLMs all use exactly what you’ve built — just more of it.

7.1 What you built

Token Embedding + Position Embedding
    -> [LayerNorm -> Self-Attention -> Residual -> LayerNorm -> FFN -> Residual] x N
    -> LayerNorm -> Linear Head

This is the decoder-only transformer architecture. GPT-2, GPT-4, Llama, Claude — all of them use this exact stack. The differences:

This model GPT-2 Small Llama 3 8B
Parameters 17,760 124M 8B
Layers 2 12 32
d_model 32 768 4,096
Attention heads 1 12 32
d_ff 64 3,072 14,336
Vocabulary 14 50,257 128,256
Context length 13 1,024 128,000

The scaling is dramatic but the components are identical. Our model is deliberately oversized for its task — we chose d_model=32 and two layers so every component is large enough to inspect. AdderBoard shows that 67 trained parameters suffice for the harder 10-digit version. We trade efficiency for interpretability: bigger matrices mean clearer attention patterns and more readable embedding structure.

7.2 Multi-head attention

We used single-head attention: one set of Q, K, V projections. Multi-head attention runs \(h\) parallel attention heads, each with dimension \(d_k = d_{model} / h\):

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O\]

Each head can learn a different attention pattern. In our addition model, one head might learn column alignment while another learns carry propagation. With a single head, both functions must be packed into one attention matrix.

For our tiny model, one head is sufficient (the two layers can split the work). At scale, multi-head attention is essential.

7.3 The 36-parameter hand-coded solution

Alex Litz showed you can set every weight in a 1-layer, single-head transformer by hand to solve 10-digit addition with only 36 parameters. Hand-coded solutions have since reached as few as 12 parameters; the smallest trained model to reach 100% accuracy uses just 67. This is a constructive proof that the architecture can represent addition — and that the algorithm it needs is remarkably compact.

The hand-coded weights implement:

  • Attention: Each output position attends to the two input positions in its column (the ones output looks at both ones digits, etc.)
  • FFN: Computes the sum and carry for each column
  • Output head: Maps the computed sum back to digit tokens

Our trained model discovers a similar solution through gradient descent, but not identical — SGD finds its own representation that may be more distributed.

A notable pattern from AdderBoard: every top solution — hand-coded and trained — uses a single layer. This seems to contradict Chapter 5, where our one-layer ablation showed degraded accuracy. The resolution: one layer is sufficient for the task but harder to train with our budget. The second layer makes learning easier (the two layers can divide labor between column alignment and carry propagation), not strictly necessary. Given enough training epochs and the right tricks (curriculum learning, multi-stage fine-tuning), a single layer finds the algorithm.

7.4 Low-rank factorization and LoRA

The 311-parameter trained solution achieves the same task with 311 parameters by discovering that the weight matrices can be factored into low-rank products:

\[W = AB \quad \text{where } A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d}, r \ll d\]

This is exactly the mathematical idea behind LoRA (Low-Rank Adaptation), the most popular method for fine-tuning large language models:

  • Start with a pretrained weight matrix \(W_0\)
  • Add a trainable low-rank update: \(W = W_0 + BA\)
  • Only train \(B\) and \(A\) (much fewer parameters than full \(W\))

The connection: our addition model’s weight matrices, after training, have effective rank much lower than their nominal dimension. The model naturally discovers low-rank structure because the task doesn’t require the full expressivity of the weight space.

7.5 Grokking

If your training curves showed a sudden accuracy jump after a long plateau, you observed grokking — a phenomenon where the model memorizes the training data first, then later discovers the generalizable algorithm.

This is relevant beyond toy tasks:

  • For synthetic agents (your research): A model that produces fluent outputs may have memorized patterns without learning the underlying structure. The fidelity/effectiveness distinction maps to memorization/generalization.
  • For LLM evaluation: High training loss doesn’t mean the model hasn’t learned; it may be in the pre-grokking phase. Conversely, low loss doesn’t mean it’s generalized.

7.6 What this means for your work

If you’re using transformers as components in statistical inference (synthetic students, conversation analysis):

  1. The architecture constrains the hypothesis class. The set of functions a transformer can represent depends on depth, width, and attention structure. For tasks with clear algorithmic structure (like addition), small models suffice.

  2. Representation choices are statistical choices. Reversing the output digits wasn’t a “trick” — it was choosing a parameterization that aligns with the causal structure of the problem. Same principle applies to how you encode conversation turns, student responses, or any sequential data.

  3. Attention patterns are interpretable alignment. When you use a transformer to model conversations, the attention weights tell you which previous utterances the model considers relevant for predicting the next one. This is directly inspectable, as we showed.

  4. Ablations quantify component contributions. The same approach we used here (remove a component, measure the effect) applies to understanding what a fine-tuned model learned vs what was already in the base model.

7.7 Next steps

  • Inspect weight matrices directly. model.blocks[0].attn.W_Q.weight is a 32x32 matrix. What structure does it have? Is it low-rank?
  • Try 5-digit or 10-digit addition. Does the model need more layers? How does parameter count scale?
  • Implement multi-head attention. Split d_model into parallel heads. Does it help for 3-digit addition? When does it become necessary?
  • Compare to the hand-coded solution. Load Alex Litz’s 36-parameter weights and verify they produce correct addition. Compare the attention patterns to your trained model.
  • Try curriculum learning. Start with 1-digit, graduate to 3-digit. The AdderBoard solutions use this for 10-digit addition.

7.8 What to take away

Five lessons from building a transformer, each of which generalizes beyond addition:

  1. The architecture is simple. Five components, each with a clear role: embeddings map tokens to vectors, positions encode order, attention gathers information across positions, FFNs compute on the gathered information, and residuals keep the signal stable. Everything in GPT-4 is a scaled version of what you built. When someone describes a new transformer variant, you can now ask: which of these five components did they change, and why?

  2. Representation choices are statistical choices. How you encode the input constrains what the model can learn. Reversing the output digits transformed a nearly impossible problem into a trivial one — same model, same data, same optimizer. This is the same principle as choosing a parameterization in any statistical model: the representation defines the effective hypothesis class. When you use transformers in practice, tokenization and input formatting are the highest-leverage decisions.

  3. Attention is learnable feature selection. The model decides which context matters for each prediction, and this decision is directly inspectable via attention maps. You verified that the addition model attends to the correct columns, and that the reversal model attends to the mirror positions. This transparency is rare in deep learning and valuable: when a model’s attention doesn’t match the expected algorithm, that’s a signal that it may be solving the task through memorization rather than computation.

  4. Understanding means predicting failure. If you can predict what breaks when you remove a component, you understand what it does. No positional embeddings \(\to\) can’t distinguish columns. No causal mask \(\to\) train/test mismatch. One layer \(\to\) limited carry propagation. Each prediction is a test of your mental model. Apply the same approach to any model you use: ablate components, predict the failure mode, check.

  5. The architecture is a hypothesis class. It defines the set of possible functions; data plus optimization select one. The same transformer architecture produces column-alignment attention for addition and anti-diagonal attention for reversal. The architecture doesn’t “know” about addition — it provides a space of learnable functions flexible enough to include the addition algorithm. Understanding this distinction is essential for knowing what a trained model has actually learned vs. what it could potentially learn.

When you use a transformer in practice — for generating synthetic data, analyzing conversations, or any other task — the same principles apply. Check the attention patterns. Ablate components. Verify that the model learned the right algorithm, not just the right outputs. The tools you built in this book are the tools of mechanistic understanding.