Blog

Short-form posts on LLM evaluation methodology, measurement in education, and whatever I'm currently thinking through. Looking for longer, structured write-ups instead? See notes.

Apr 15, 2026 · transformers
Why is my favorite LLM getting better?
I evaluate language models for a living. I built a transformer small enough to see through (three-digit addition, 17,000 parameters) to sharpen my intuitions about what my evaluation methods are actually measuring. Three observations from the build.