Mike Ion
I build statistical methods for human interaction data at scale: the infrastructure that organizes it, the features that make it analyzable, and the evaluation frameworks that tell us when the AI systems on top are doing what we asked them to.
Current research
- Evaluation of LLM output distributions. Per-prompt similarity metrics (cosine, BERTScore) can systematically misrank instruction-tuned models: instruction tuning shifts the center of the output distribution while preserving its spread, and cosine tracks the spread and misses the shift. Direct implications for training-data filtering pipelines.
- Temporal structure in multi-party conversation. Lightweight single-pass features (burst coefficient, cluster density, response acceleration, interaction momentum, timing consistency) recover six conversational archetypes on MathMentorDB, with resolution rates ranging from above 90% to under 30%. Earlier Bayesian hierarchical Hawkes analysis presented at JSM 2025.
- Discourse classification with LLMs. A 32-move pedagogical taxonomy labeled by three production LLMs across thousands of MathMentorDB messages. Inter-model agreement is high for coarse categories but diverges on fine-grained moves; discourse-move sequences predict whether conversations resolve or get abandoned.
- Conversational data infrastructure. MathMentorDB: 200,332 structured math-tutoring conversations reconstructed from 5.5M raw Discord messages via a disentanglement pipeline combining message timing, participant co-occurrence, and topic coherence. Submitted to LREC-COLING 2026 for public release.
Selected Papers
Across 8,290 paired human and model continuations, cosine similarity ranks instruction-tuned models as most human-like in 5 of 6 genres, while kernel MMD² ranks those same models as least human-like. A 67-feature linguistic decomposition explains the disagreement: instruction tuning shifts the distribution's center without changing its spread. Per-prompt similarity is insensitive to the shift; population-level distance responds to both.
A propensity-score MSE framework for assessing whether LLM-generated conversations match the distribution of real ones. L1-regularized classifiers on interpretable feature subsets reveal that surface and cognitive fidelity are coupled: improving one without attending to the other can worsen overall fidelity. The framework turns a scalar quality score into a diagnostic for iterative prompt refinement.
A hierarchical Bayesian Hidden Markov Model fit to 2,437 math tutoring conversations (51k+ messages) from MathMentorDB, using LLM-classified discourse moves. Four latent pedagogical states emerge unsupervised (Problem Introduction, Exploration, Lecturing, Working). Resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference — replicating impasse-driven learning findings at two orders of magnitude greater scale than prior manually-annotated studies.
Latest writing
-
Why is my favorite LLM getting better?
I evaluate language models for a living. I built a transformer small enough to see through (three-digit addition, 17,000 parameters) to sharpen my intuitions about what my evaluation methods are actually measuring. Three observations from the build.
News
- Apr 2026Measuring Simulation Fidelity via Statistical Detectability (with Kevyn Collins-Thompson) accepted to ACM Learning@Scale 2026 (Seoul, July 2026) — a propensity-score MSE diagnostic for whether LLM-generated tutoring conversations match the distribution of real ones, showing surface and cognitive fidelity are coupled.
- Apr 2026From Moves to Pathways (with Michael Light and Kevyn Collins-Thompson) accepted to AIED 2026 (Late-Breaking Results poster track) — a hierarchical Bayesian HMM over 2,437 math tutoring conversations recovers four latent pedagogical states unsupervised; resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference.
- Mar 2026Talk at Infinicon (San Luis Obispo): a walkthrough of getting LLMs to return structured outputs — typical prompting approaches for specific fields and categories, then fine-tuning an open-weight model on a free Colab instance and comparing the two. [notebook]
- Oct 2025Preprint Chip-Firing and the Sandpile Group of the R10 Matroid posted on arXiv (with Alex McDonough) and since submitted to Galois Journal of Algebra — a description of chip-firing on R10 using complex numbers and representatives for the 162 equivalence classes of its sandpile group.
- Sep 2025Started as Lecturer in the Statistics Department at California Polytechnic State University, San Luis Obispo.
- Aug 2025Talk at JSM 2025 (Nashville): Bayesian Hierarchical Modeling of Large-Scale Math Tutoring Dialogues — a Bayesian framework for analyzing cognitive load in math tutoring, applied to MathMentorDB's 5.4M messages across 200K+ conversations.
- Feb 2025Awarded $12,435 from the Academic Innovation Fund to develop AI-powered technical-interview practice tools for data science students.
- Jan 2025Talk at JMM 2025 (Seattle): Text-as-Data in Mathematics Education: Harnessing LLMs to Analyze Student Conversations at Scale.
- Sep 2024Started as postdoctoral fellow at the University of Michigan School of Information.
- Mar 2024Defended my PhD at the University of Michigan — Beyond the Classroom: Exploring Mathematics Engagement in Online Communities with Natural Language Processing.