1 The Problem

Takeaway

How you represent the problem constrains what the model can learn. Reversing the output digits — ones first — is the difference between 1% and 95%+ accuracy. Same model, same optimizer, same data.

1.1 Setup

Every chapter starts by loading _common.py, which contains all shared infrastructure: the tokenizer, dataset generators, the model definition, and training utilities. We also fix the random seed so results are reproducible — the same seed produces the same initial weights and the same training trajectory.

from _common import *
torch.manual_seed(42)
print(f"Device: {DEVICE}")
print(f"PyTorch {torch.__version__}")

Device: mps
PyTorch 2.7.1

DEVICE tells us whether we’re running on CPU, GPU (CUDA), or Apple Silicon (MPS). The model is small enough that CPU works fine, but GPU speeds up training in later chapters.

1.2 What we’re solving

We’re teaching a neural network to add numbers. Not by giving it an add function, but by showing it thousands of examples like 123+456=9750 and letting it discover the algorithm.

Why addition? Because it’s an algorithm with known structure. We can predict exactly what the model should learn — column alignment, carry propagation — and verify whether it did. It’s complex enough to require the key transformer components (attention for routing information between positions, FFN for computing sums) but simple enough that 17,000 parameters suffice on a laptop. And it’s a real benchmark: AdderBoard tracks the smallest transformers that can add 10-digit numbers, with trained solutions as small as 67 parameters reaching 100% accuracy. We use 3-digit addition and a deliberately larger model so we can see every step clearly.

The model sees tokens (integers), not characters. Step one: build the mapping.

print(f"Vocabulary ({VOCAB_SIZE} tokens):")
for token, idx in sorted(VOCAB.items(), key=lambda x: x[1]):
    print(f"  '{token}' -> {idx}")

Vocabulary (14 tokens):
  '0' -> 0
  '1' -> 1
  '2' -> 2
  '3' -> 3
  '4' -> 4
  '5' -> 5
  '6' -> 6
  '7' -> 7
  '8' -> 8
  '9' -> 9
  '+' -> 10
  '=' -> 11
  '<PAD>' -> 12
  '<EOS>' -> 13

14 tokens. Digits 0–9, the operators + and =, a padding token <PAD>, and an end-of-sequence marker <EOS>. That’s the entire language.

<EOS> tells the model when the answer is complete — without it, the model wouldn’t know when to stop generating digits. <PAD> is reserved for padding sequences to equal length in a batch, though our sequences are all the same length (13 tokens: aaa+bbb=ssss<EOS>), so it’s never actually used here. It’s standard practice to include it in the vocabulary even when not needed.

1.3 Encoding addition

inp, tgt = encode_addition(123, 456)
print(f"123 + 456 = 579:")
print(f"  Input:  {inp} -> '{decode_tokens(inp)}'")
print(f"  Target: {tgt} -> '{decode_tokens(tgt)}'")

123 + 456 = 579:
  Input:  [1, 2, 3, 10, 4, 5, 6, 11] -> '123+456='
  Target: [9, 7, 5, 0, 13] -> '9750<EOS>'

Notice the target: 9750, not 0579. The sum is reversed — ones digit first.

This is the single most important design choice in the entire project. Here’s why.

The model generates the answer one token at a time, left to right. Each digit it produces becomes part of the context for predicting the next digit. This makes the output order critical:

MSD-first (thousands first): To predict the leading digit, the model must determine whether there’s a carry cascading all the way from the ones column. It has to solve the entire problem before emitting the first token.

LSD-first (ones first): The model predicts the ones digit first (just add the ones columns, mod 10). Then it predicts the tens digit, already knowing whether there was a carry from the ones. Each digit only depends on the columns already processed. This is exactly how grade-school addition works — right to left.

Every top solution on AdderBoard uses reversed output. We tried MSD-first; the model got stuck at 1% accuracy.

Connection: Representation as parameterization

Reversing the output isn’t a “trick” — it’s choosing a parameterization that aligns with the causal structure of the problem. In statistics, this is the same principle as choosing a link function in GLMs or a prior that matches the generative process. The model’s job is easier when the representation reflects the structure of the problem.

Grade-school addition works right to left because each column’s result depends only on the column itself and a carry from the previous (rightward) column. MSD-first output forces the model to predict the last thing computed (the leading digit, which depends on all carries) as the first output token. LSD-first output aligns the generation order with the computational order. The representation respects the causal graph.

Let’s lay out all 13 token positions to make the reversed encoding explicit.

# The full training sequence, laid out:
full = inp + tgt
print("Full sequence the model sees during training:")
print(f"  Tokens: {full}")
print(f"  Decoded: '{decode_tokens(full)}'")
print()
for i, t in enumerate(full):
    role = ""
    if i < 3: role = f"operand a, {'hundreds' if i==0 else 'tens' if i==1 else 'ones'}"
    elif i == 3: role = "operator"
    elif i < 7: role = f"operand b, {'hundreds' if i==4 else 'tens' if i==5 else 'ones'}"
    elif i == 7: role = "equals"
    elif i < 12: role = f"sum, {'ones' if i==8 else 'tens' if i==9 else 'hundreds' if i==10 else 'thousands'}"
    elif i == 12: role = "end of sequence"
    print(f"  pos {i:2d}: '{VOCAB_INV[t]}' ({t:2d})  {role}")

Full sequence the model sees during training:
  Tokens: [1, 2, 3, 10, 4, 5, 6, 11, 9, 7, 5, 0, 13]
  Decoded: '123+456=9750<EOS>'

  pos  0: '1' ( 1)  operand a, hundreds
  pos  1: '2' ( 2)  operand a, tens
  pos  2: '3' ( 3)  operand a, ones
  pos  3: '+' (10)  operator
  pos  4: '4' ( 4)  operand b, hundreds
  pos  5: '5' ( 5)  operand b, tens
  pos  6: '6' ( 6)  operand b, ones
  pos  7: '=' (11)  equals
  pos  8: '9' ( 9)  sum, ones
  pos  9: '7' ( 7)  sum, tens
  pos 10: '5' ( 5)  sum, hundreds
  pos 11: '0' ( 0)  sum, thousands
  pos 12: '<EOS>' (13)  end of sequence

Notice the causal structure: each output position depends only on positions to its left. Position 8 (ones) can be computed from the inputs alone; position 9 (tens) can use position 8’s result to handle carries.

1.4 The second task: reversal

We’ll also train a model on sequence reversal — same vocabulary, same architecture, but a completely different computation. This gives us a contrast: addition needs column-alignment attention, reversal needs mirror-pattern attention.

seq = [1, 2, 3, 4, 5]
inp_r, tgt_r = encode_reversal(seq)
print(f"Reverse {seq}:")
print(f"  Input:  '{decode_tokens(inp_r)}'")
print(f"  Target: '{decode_tokens(tgt_r)}'")

Reverse [1, 2, 3, 4, 5]:
  Input:  '12345='
  Target: '54321<EOS>'

Same tokens, different task. The architecture is identical — only the training data changes what the attention patterns look like.

Check your understanding

Why does reversing the output help? (Hint: which digit can the model predict without seeing any other output digits? Which digit requires knowledge of all other output digits?)
If we used a standard feed-forward network (input \(\to\) hidden layers \(\to\) output, no attention) instead of a transformer, what information would it lack? What structural assumption does a feed-forward network make about the relationship between input positions?
The reversal task and addition task use the same vocabulary. What’s fundamentally different about the computation required? Think about whether each output token depends on one input position or many.