Mike Ion

Mike Ion

Postdoctoral Research Fellow · University of Michigan School of Information

I build statistical methods for human interaction data at scale: the infrastructure that organizes it, the features that make it analyzable, and the evaluation frameworks that tell us when the AI systems on top are doing what we asked them to.

Current research

  1. Evaluation of LLM output distributions. Per-prompt similarity metrics (cosine, BERTScore) can systematically misrank instruction-tuned models: instruction tuning shifts the center of the output distribution while preserving its spread, and cosine tracks the spread and misses the shift. Direct implications for training-data filtering pipelines.
  2. Temporal structure in multi-party conversation. Lightweight single-pass features (burst coefficient, cluster density, response acceleration, interaction momentum, timing consistency) recover six conversational archetypes on MathMentorDB, with resolution rates ranging from above 90% to under 30%. Earlier Bayesian hierarchical Hawkes analysis presented at JSM 2025.
  3. Discourse classification with LLMs. A 32-move pedagogical taxonomy labeled by three production LLMs across thousands of MathMentorDB messages. Inter-model agreement is high for coarse categories but diverges on fine-grained moves; discourse-move sequences predict whether conversations resolve or get abandoned.
  4. Conversational data infrastructure. MathMentorDB: 200,332 structured math-tutoring conversations reconstructed from 5.5M raw Discord messages via a disentanglement pipeline combining message timing, participant co-occurrence, and topic coherence. Submitted to LREC-COLING 2026 for public release.

Selected Papers

in preparation
The Evaluation Blind Spot: Per-Prompt Similarity Can Systematically Misrank Instruction-Tuned Text Distributions

Across 8,290 paired human and model continuations, cosine similarity ranks instruction-tuned models as most human-like in 5 of 6 genres, while kernel MMD² ranks those same models as least human-like. A 67-feature linguistic decomposition explains the disagreement: instruction tuning shifts the distribution's center without changing its spread. Per-prompt similarity is insensitive to the shift; population-level distance responds to both.

Ion, M. & Godfrey, J. · 2026
ACM Learning@Scale 2026
Measuring Simulation Fidelity via Statistical Detectability: A Diagnostic Framework for AI-Generated Tutoring Conversations

A propensity-score MSE framework for assessing whether LLM-generated conversations match the distribution of real ones. L1-regularized classifiers on interpretable feature subsets reveal that surface and cognitive fidelity are coupled: improving one without attending to the other can worsen overall fidelity. The framework turns a scalar quality score into a diagnostic for iterative prompt refinement.

Ion, M. & Collins-Thompson, K. · 2026
AIED 2026 · Late-Breaking Results poster
From Moves to Pathways: Characterizing Pedagogical Discourse Dynamics in Online Tutoring with Bayesian Generative Modeling

A hierarchical Bayesian Hidden Markov Model fit to 2,437 math tutoring conversations (51k+ messages) from MathMentorDB, using LLM-classified discourse moves. Four latent pedagogical states emerge unsupervised (Problem Introduction, Exploration, Lecturing, Working). Resolved and unresolved sessions differ less in what tutors say than in whether students are actively attempting inference — replicating impasse-driven learning findings at two orders of magnitude greater scale than prior manually-annotated studies.

Light, M., Ion, M. & Collins-Thompson, K. · 2026

Full publication list →

Latest writing

All posts →

News