ThinkARM Framework: Decoding AI’s Mathematical Reasoning Episodes

高效码农

4 hours ago

Decoding the Black Box of LLM Mathematical Reasoning: A Deep Dive into the ThinkARM Framework

What is the fundamental problem with evaluating AI reasoning today? We obsess over final accuracy and token counts while remaining blind to the internal cognitive structure that separates effective thinking from mere text generation. The ThinkARM framework reveals that the difference between reasoning and non-reasoning models is not how much they write, but how they structure their thinking into distinct functional episodes.

As reasoning models like o1 and DeepSeek-R1 dominate the headlines, we face a paradox: we’ve never had more visibility into AI thought processes, yet we still struggle to answer basic questions about what makes reasoning work. When a model generates 9,000 tokens to solve a math problem, which parts represent genuine insight versus wasted computation? ThinkARM provides the Rosetta Stone for this problem, translating raw text into a structured cognitive map based on decades of human problem-solving research.

Why Token-Level Analysis Fails to Capture Reasoning Quality

Why can’t we judge reasoning by length or accuracy alone? Because two models can reach the same correct answer through radically different cognitive paths—one via systematic analysis and verification, another through aimless trial-and-error—and token counts treat both as equivalent.

Traditional evaluation metrics operate at the wrong level of abstraction. Accuracy tells you the outcome but not the journey. Token count measures output volume but not organizational quality. This is like judging a surgeon’s skill by operation duration alone—you miss the critical difference between efficient precision and chaotic fumbling.

Consider two models solving a geometry problem. Model A spends 2,000 tokens in a tight loop: analyze the diagram, explore one approach, verify it fails, return to analysis, then implement the correct solution. Model B burns through 5,000 tokens brute-forcing dozens of calculations without ever reflecting on why they might work. Current metrics would penalize Model A for being “less efficient” while rewarding Model B for thoroughness, missing that Model A demonstrates genuine metacognitive control.

This opacity becomes dangerous when we deploy these systems in high-stakes domains. A medical diagnosis assistant might generate plausible reasoning but lack the verification steps that catch subtle contradictions. A legal analysis tool could produce lengthy arguments without the exploratory phase that considers alternative interpretations. ThinkARM gives us the vocabulary to distinguish these cases systematically.

The ThinkARM Framework: Applying 40 Years of Cognitive Science to AI

How can we transform a stream of text tokens into a meaningful description of reasoning stages? By adopting Schoenfeld’s Episode Theory—a framework validated through hundreds of hours of human think-aloud protocols—and extending it with two AI-specific categories, creating an eight-episode taxonomy that captures everything from initial reading to final answer submission.

The foundation of ThinkARM is Alan Schoenfeld’s 1985 Episode Theory, developed by analyzing videotapes of students and mathematicians solving problems. Unlike simplistic stage models, Schoenfeld discovered that successful problem-solving is defined by dynamic regulation—knowing when to shift between different cognitive modes. The theory identifies six core episodes:

Read: Extracting and restating given information without inference
Analyze: Constructing representations, recalling theories, making logical deductions
Plan: Announcing solution strategies and next steps
Implement: Executing calculations, substitutions, and concrete operations
Explore: Generating tentative ideas, making guesses, exploring alternatives with uncertainty
Verify: Checking correctness, evaluating results, confirming validity

For AI analysis, ThinkARM adds two crucial categories:

Monitor: Short meta-cognitive interjections like “Wait, let me think” that signal self-awareness
Answer: Explicit answer statements that mark convergence

Application scenario: When analyzing a model’s response to “Find all integer solutions to x² + y² = 25,” the framework would tag “The question asks for integer pairs where the sum of squares equals 25” as Read, “This is a circle with radius 5, so we need lattice points on the circle” as Analyze, “Maybe we can test x values from -5 to 5” as Explore, “Testing x=3 gives y²=16, so y=±4” as Implement, and “Let me verify we found all possibilities” as Verify.

The genius of this approach is its sentence-level granularity. By tagging each sentence individually rather than paragraphs, ThinkARM enables large-scale statistical aggregation while preserving the temporal sequence needed for transition analysis. This trade-off for scalability means we can analyze 410,991 sentences from 15 models solving 100 math problems—a dataset that would be impossible to annotate manually.

The Automated Annotation Pipeline: From Gold Standard to GPT-5 at Scale

How can we possibly annotate hundreds of thousands of sentences without spending months and fortunes? By building a high-quality human-annotated gold standard of 7,067 sentences, evaluating multiple state-of-the-art models as automated annotators, and selecting GPT-5 for its highest agreement with human experts—achieving 86.33% accuracy and 82.85% Kappa.

The path to scalable annotation follows a rigorous validation process. The team first manually annotated 7,067 sentences across 99 representative problems, creating a gold standard that embodies the refined episode definitions. This wasn’t simple labeling—annotators had to consider each sentence’s function within the full reasoning context. A phrase like “Let me try a different approach” could be Explore if it’s abandoning a failed path, or Plan if it’s committing to a new strategy.

With this gold standard, they evaluated four candidate annotators: GPT-4.1, GPT-5, Gemini-2.5-Flash, and Gemini-2.5-Pro. GPT-5 emerged as the clear winner, particularly for reasoning model traces where its Kappa reached 82.54%. This finding itself is fascinating—it means one AI can learn to understand another’s cognitive structure with high reliability.

The annotation process uses a carefully engineered prompt that provides:

The complete annotation guidebook with definitions and examples
The original math problem
All previously annotated context
Batch of sentences to label
JSON output format specification

For each sentence, GPT-5 must provide not just the episode tag but a justification. This requirement for explainability dramatically improves reliability—you can audit why a sentence was classified as Explore versus Analyze.

Operational example: To annotate DeepSeek-R1’s responses, you would run:

export OPENAI_API_KEY="your-key"
python -m method.label --annotate_model gpt-5 --response_model deepseek-r1

The system automatically:

Loads data/raw/deepseek-r1.json containing the model’s 100 responses
Splits responses into sentence sequences
Processes in batches of 20-50 sentences with full context
Writes results to data/label/deepseek-r1/ with filenames matching problem IDs
Generates a quality report comparing automatic labels to gold standard

For 15 models, this completes in 6-8 hours at a cost of $200-300—a fraction of human annotation expenses.

The Three-Phase Heartbeat: A Universal Signature of Machine Reasoning

Do all successful reasoning models share a universal temporal structure? Yes. ThinkARM analysis reveals a strikingly consistent “heartbeat” pattern across diverse models: an initialization phase dominated by Read, Analyze, and Plan; an execution phase where Implement peaks; and a convergence phase marked by surging Verify and Monitor activity.

When the team normalized each reasoning trace to a 0-100% progress scale and computed episode frequencies over time, they discovered what they call the “cognitive heartbeat of machine reasoning.” This pattern appears in DeepSeek-R1, QwQ-32B, Phi-4, and others, regardless of architecture or size.

Initialization Phase (0-20% progress):

Read drops sharply after the first few sentences—models quickly process the problem statement
Analyze and Plan decay gradually, showing that strategic reasoning persists beyond initial planning
Explore is front-loaded, representing hypothesis search that narrows as execution proceeds

Execution Phase (20-80% progress):

Implement forms a characteristic bell curve, peaking in the middle. This sustained concrete operation provides the backbone of the solution
Verify and Monitor appear as frequent small loops attached to this backbone, creating micro-cycles of exploration and validation

Convergence Phase (80-100% progress):

Verify increases steadily, showing that evaluation intensifies toward the end
Monitor follows a U-shape—elevated at the beginning (initial uncertainty) and end (final checks)
Answer appears as a sharp step function in the final 5%, marking commitment

Application scenario: Comparing two models on the same problem, you might find that Model A’s Implement phase starts at 25% progress while Model B’s starts at 40%. This reveals Model A spent more time in exploration and planning, which often correlates with higher correctness on complex problems. A model that rushes to implementation is like a student who starts calculating before understanding the question.

The most compelling visualization is the temporal dynamics plot (Figure 3 in the original), which shows these curves as smooth, distinct traces that together form the heartbeat signature. This isn’t just descriptive—it’s diagnostic. Deviations from this pattern, like a flat Verify line or absent Monitor phase, immediately flag potential reasoning deficiencies.

Reasoning vs Non-Reasoning: The Structural Chasm Beneath Token Counts

What fundamentally separates reasoning models from standard instruction-following models? It’s not response length but cognitive architecture. Non-reasoning models allocate over 85% of tokens to Implement, creating a feed-forward pipeline, while reasoning models distribute effort across Analyze, Explore, and Verify, forming iterative loops that interleave exploration with evaluation.

The token allocation data tells a stark story (Table 2). Standard models like GPT-4o and Gemini-2.0-Flash dedicate 89-98% of their output to Implement, with minimal Explore (<3%) and negligible Verify (<5%). Their reasoning traces are monotonic marches from reading to calculation to answer.

In contrast, open-source reasoning models show balanced distributions:

DeepSeek-R1: Implement 36%, Analyze 31%, Explore 9%, Verify 10%
QwQ-32B: Implement 36%, Analyze 25%, Explore 14%, Verify 11%
Even distilled versions like R1-Distill-Qwen-1.5B preserve this profile closely

Application scenario: When deploying a model for automated theorem proving, a non-reasoning model might quickly generate a “proof” that looks correct but lacks the verification steps that catch subtle flaws. The reasoning model’s 10% Verify allocation means it’s explicitly checking its work, similar to a mathematician who re-reads each line for logical gaps.

The proprietary reasoning models (GPT-o1-mini, o3-mini) present an interesting case. When only their final answers are observable, their episode distribution looks much more like non-reasoning models—Implement-heavy with little Explore or Verify. This suggests that the reasoning process happens in hidden “thought tokens” we can’t see, and the visible output is just the distilled conclusion. For open-source models where we can access full traces, we see the complete cognitive structure.

Author’s reflection: What struck me most here is the preservation of structure through distillation. The fact that a 1.5B parameter student model can mirror its 671B teacher’s episode distribution suggests that reasoning is transferable as a “cognitive style” independent of scale. This challenges the brute-force scaling orthodoxy and hints that smaller, more efficient reasoning models are achievable not just through model compression but through transferring reasoning patterns. The implication for resource-constrained deployments is profound: we don’t need massive models if we can teach smaller ones to think in the right structured way.

Exploration: The Critical Branching Point Between Success and Failure

Which episode most strongly predicts solution correctness? Exploration serves as the critical branching point. Correct solutions consistently route exploration into monitoring or re-analysis, while incorrect solutions let exploration continue aimlessly or terminate prematurely into execution.

The correctness-oriented case study provides perhaps the most actionable insight. By extracting 73 features from 500 reasoning traces—including episode ratios and transition matrices—and training a Lasso-regularized logistic regression, the team identified which behaviors correlate with success.

Positive predictors (increasing correctness):

Explore→Monitor (+0.41 coefficient): After exploring uncertain territory, successful models pause to reflect on their process
Explore→Analyze (+0.31): Exploration leads to renewed conceptual analysis, not just more calculations
Monitor→Analyze (+0.28): Self-monitoring triggers deeper analysis, showing metacognitive control

Negative predictors (decreasing correctness):

High Explore ratio (-0.54): Sustained exploration without resolution signals confusion
Explore→Verify (-0.45): Verifying before exploration has stabilized leads to false confidence
Implement→Read (-0.33): Being forced to re-read the problem mid-execution indicates breakdown

Application scenario: Imagine you’re building a tutoring system that watches students solve problems. When you detect a student has been in Explore mode for 10 sentences without a Monitor or Analyze transition, you could intervene: “It seems you’re trying several approaches. Let’s step back and examine what these attempts have in common.” The same applies to AI models—this framework gives us the diagnostic signals for targeted intervention.

The mechanism is clear: Exploration reflects uncertainty. The difference between expert and novice problem-solvers isn’t avoiding uncertainty but managing it. Experts use exploration to generate hypotheses, then immediately engage monitoring to evaluate those hypotheses and analysis to refine them. Novices either avoid exploration altogether (rushing to calculation) or get lost in it (trying random things without reflection).

This insight reframes “overthinking.” What we call overthinking is often undirected exploration without metacognitive control. But directed exploration paired with monitoring is the hallmark of sophisticated reasoning. ThinkARM gives us the vocabulary to distinguish them.

Practical Deployment: Running ThinkARM on Your Own Models

How can researchers deploy ThinkARM to analyze their custom models? The framework is designed as a turnkey solution: install dependencies, format your model outputs, run the annotation pipeline, and execute analysis modules—all with modular components that can be used independently or as a complete workflow.

Installation and Setup

The environment requirements are minimal, making it accessible for most research labs:

# Clone and install
git clone https://github.com/MingLiiii/ThinkARM
cd ThinkARM
pip install -r requirements.txt  # ~20 packages including openai, google-generativeai, pandas

# Configure API access
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="..."

The dependency list is intentionally lean: core data handling (pandas, numpy), visualization (matplotlib, seaborn), and API clients. No GPU requirements or complex deep learning frameworks needed.

Operational example: A university lab with limited compute resources can run the full analysis on a laptop. The heavy lifting is done by API calls to GPT-5, which means your local machine only handles data formatting and aggregation.

Data Formatting for Your Models

Your model outputs must follow a simple JSON structure:

{
  "problem_id": 42,
  "question": "Find the sum of all positive divisors of 48.",
  "response": "Let me start by factoring 48... [full reasoning trace]"
}

Place these files in data/raw/your_model_name.json. The framework expects one file per model containing all 100 problems, but you can adapt this for custom benchmarks.

Application scenario: You’re evaluating a fine-tuned version of Llama-3 for mathematical reasoning. After generating responses to your benchmark, you format them as above. The clear separation of raw data, annotations, and results means you can re-run analyses without touching the original model outputs.

Executing the Full Pipeline

# 1. Automatic annotation (6-8 hours for 15 models)
python -m method.label --annotate_model gpt-5 --response_model your_model --batch_size 30

# 2. Correctness evaluation (if you have ground truth)
python -m analysis.correctness_eval --model your_model --evaluator_model gpt-4o

# 3. Temporal dynamics analysis
python -m analysis.temporal --model your_model --output_dir results/

# 4. Word cloud generation
python -m analysis.word_cloud --model your_model

# 5. Diagnostic analysis for correctness
python -m analysis.diagnostic --models your_model deepseek-r1 qwq-32b

Operational example: When analyzing a new efficiency method like “EarlyExit,” you would:

Generate responses with and without the method
Run both through annotation
Use analysis.episode_ngram_discriminate to identify which transition patterns are lost
Compare Verify and Analyze ratios to quantify the cognitive cost

The batch processing handles rate limits gracefully, and you can resume interrupted runs. Costs are predictable: about $0.50-1.00 per 1,000 sentences with GPT-5.

Interpreting Results

After annotation, your data/label/your_model/ directory contains JSON files with detailed annotations:

{
  "sentence": "Perhaps we should try modular arithmetic here.",
  "sentence-category": "Explore",
  "sentence-category-reason": "Uses tentative language 'perhaps' and 'try' indicating uncertainty and hypothesis generation without commitment."
}

The framework aggregates these into actionable reports. A typical finding might be: “Your model allocates only 3% to Verify versus 10% in top-performing models, suggesting insufficient self-checking.”

Author’s reflection: What I appreciate most about this implementation is its modularity. You don’t need to buy into the entire framework to get value. A researcher struggling with model overthinking could run just the temporal analysis to see if their model’s Explore phase decays too slowly. Another focused on safety could extract only Verify patterns. This “use what you need” design respects that research priorities vary, and it lowers the barrier to adoption.

Case Study: Using Episode Patterns to Predict Correctness and Optimize Efficiency

Can episode patterns predict outcomes before the final answer appears? Yes—a Lasso logistic regression model trained on episode features achieves 78% accuracy in mid-trajectory correctness prediction, with Explore→Monitor transitions being the strongest signal, far more than raw token count or calculation volume.

Predicting Correctness from Behavior

The team created a binary classification task: given a partially completed reasoning trace, can we predict if the final answer will be correct? They engineered features across three categories:

Global statistics (total tokens, thinking ratio)
Episode intensities (8 token ratio features)
Transition features (64-dimensional matrix of episode-to-episode shifts)

The Lasso model selected the most predictive features automatically. Results showed that transition features dominated token counts in importance. The top positive predictors were all about how exploration gets resolved:

Explore→Monitor: +0.41
Explore→Analyze: +0.31
Monitor→Analyze: +0.28

Application scenario: In production systems, you could implement early stopping or dynamic compute allocation. When a model’s real-time trace shows no Explore→Monitor transitions after 50 sentences, you could trigger a “rethinking” prompt or allocate more test-time compute. This is far more sophisticated than naively limiting response length.

The negative predictors were equally revealing. A high Explore ratio (-0.54) was the strongest risk factor, but specific harmful patterns included Explore→Verify (-0.45), suggesting premature validation, and Implement→Read (-0.33), indicating mid-execution confusion.

The Real Cost of Efficiency Methods

Efficiency techniques like L1 and ThinkPrune promise shorter reasoning chains, but ThinkARM reveals they achieve this by selectively deleting specific cognitive episodes.

Table: Episode Allocation Comparison

Model	Implement	Analyze	Explore	Verify
R1-Distill-Qwen-1.5B (baseline)	34.39%	26.93%	13.23%	11.43%
L1-optimized	34.03%	18.34%	15.14%	6.99%
ThinkPrune-optimized	39.39%	20.75%	10.74%	8.37%
arora2025training	36.90%	28.32%	8.76%	9.94%

The mutual information analysis of lost transition patterns is striking:

L1 suppresses complex verification loops like N-V-N (Analyze→Verify→Analyze) with MI scores up to 0.376, indicating nearly complete elimination of recursive reflection
ThinkPrune shows similar but weaker suppression
arora2025training preserves most transition topology, with peak MI scores only ~0.10

Application scenario: If you’re building a financial analysis assistant where accuracy is paramount, L1’s 40% reduction in Verify tokens is unacceptable. But for a customer service chatbot handling simple queries, this trade-off might be reasonable. ThinkARM gives you the quantitative basis for these decisions instead of relying on gut feelings.

Author’s reflection: This efficiency analysis gave me pause about the industry’s rush toward shorter reasoning. We’re seeing a classic speed-quality tradeoff, but the cognitive mechanism is now clear: we’re not uniformly compressing—we’re amputating the model’s ability to doubt itself. The Verify episode is essentially the model’s inner critic. Removing it doesn’t make the model more efficient; it makes it more arrogant. I suspect future work will need to reintroduce verification in more compact forms, perhaps through single-token “pause-to-think” signals rather than explicit natural language checks.

Action Checklist: Implementing ThinkARM in Your Research

Prerequisites

[ ] Python 3.8+ environment
[ ] OpenAI API key with GPT-5 access (or GPT-4.1 for cost savings)
[ ] At least 100 model responses for statistical validity

Setup Phase (15 minutes)

Clone repository: git clone https://github.com/MingLiiii/ThinkARM
Install dependencies: pip install -r requirements.txt
Set API keys: export OPENAI_API_KEY="sk-..."
Verify installation: python -c "import method.label; print('Ready')"

Data Preparation (30 minutes)

Format model outputs as JSON with problem_id, question, response fields
Place in data/raw/your_model.json
(Optional) Create subset files for domain-specific analysis
Run quick sanity check on first 5 responses

Annotation Phase (45-60 minutes for 1 model)

Execute: python -m method.label --annotate_model gpt-5 --response_model your_model
Monitor progress and API usage
Review quality report comparing to gold standard
If Kappa < 0.75, consider manual review of ambiguous episodes

Analysis Phase (20 minutes)

Run temporal analysis: python -m analysis.temporal --model your_model
Generate transition patterns: python -m analysis.episode_ngram_preprocess
Compare to baselines: python -m analysis.episode_ngram_discriminate your_model.json baseline.json
Extract key metrics: Verify ratio, Explore→Monitor frequency, Implement dwell time

Interpretation Phase (15 minutes)

Identify deviations from three-phase heartbeat pattern
Check if Verify ratio falls below 5% (risk flag)
Look for suppression of complex loops (N-V-N patterns)
Correlate episode patterns with your custom correctness metrics

Optimization Loop

Based on findings, adjust model training or decoding
Re-run annotation on improved version
Use diagnostic features (Table 4) to verify improvement in target areas
Iterate until episode distribution aligns with high-performing models

Cost Management

Budget $250-300 for full 15-model analysis
Use GPT-4.1 instead of GPT-5 to halve costs with minimal quality loss
Batch size of 30-50 sentences optimizes API efficiency
Cache results aggressively; re-running costs the same as first run

One-Page Overview

What is ThinkARM?
A scalable framework that translates LLM reasoning traces into structured cognitive episodes (Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer) based on Schoenfeld’s validated problem-solving theory, enabling systematic analysis of reasoning quality beyond accuracy metrics.

Core Innovation
Sentence-level automated annotation using GPT-5 achieves 82.85% agreement with human experts, making large-scale cognitive analysis practical for the first time.

Key Findings

Universal three-phase “heartbeat”: Analyze/Explore decay → Implement peaks → Verify/Monitor surge
Reasoning vs non-reasoning models differ in structure, not length: 36% vs 89% Implement allocation
Exploration is healthy uncertainty—success depends on routing it to Monitor/Analyze, not avoiding it
Efficiency methods like L1 destroy verification loops; smarter methods preserve cognitive topology

Practical Value

Diagnosis: Identify why models fail on specific problem types by analyzing episode transitions
Optimization: Use behavioral signatures to guide training and decode strategies
Evaluation: Predict correctness mid-trajectory using transition patterns (78% accuracy)
Efficiency: Quantify the cognitive cost of speed improvements to make informed trade-offs

How to Use

Format model responses as JSON
Run python -m method.label with GPT-5
Execute analysis modules (temporal, diagnostic, n-gram)
Interpret episode distributions and transition patterns against benchmarks
Apply insights to improve model reasoning structure

Data Provided

410,991 annotated sentences from 15 models
7,067 sentence human-verified gold standard
Pre-built analysis pipelines for reproducible research

Frequently Asked Questions

Q1: How long does it take to analyze a new model with ThinkARM?
A: The automated annotation takes about 4-5 hours for a single model’s 100 responses, processing roughly 3,000-9,000 sentences. Analysis modules run in 10-20 minutes. If you’re analyzing multiple models, the annotation phase can be parallelized by running separate processes for each model, reducing total wall-clock time.

Q2: Can I use ThinkARM for non-mathematical reasoning tasks?
A: The framework is theoretically domain-agnostic since Schoenfeld’s theory applies to general problem-solving. However, the annotation guidebook and gold standard are math-specific. You would need to create a new gold standard of 500+ sentences in your target domain and verify that GPT-5 achieves Kappa > 0.75 on that domain before trusting the automatic annotations.

Q3: What’s the cost difference between using GPT-5 vs GPT-4.1 for annotation?
A: GPT-5 achieves 86.33% accuracy vs GPT-4.1’s 86.10%—a marginal difference. However, GPT-5 shows slightly better agreement on reasoning model traces (82.54% vs 82.39% Kappa). In practice, GPT-4.1 costs about 40% less per token and is perfectly adequate for most analyses. Reserve GPT-5 for final validation or when working with novel model architectures.

Q4: How do I know if my model’s episode distribution is “good”?
A: Compare against the high-performing open-source reasoning models: Implement should be 30-40%, Analyze 25-35%, Explore 8-15%, Verify 8-12%. More importantly, check for the three-phase heartbeat pattern in temporal analysis. If your Verify is <5% or you lack clear Explore→Monitor transitions, your model likely lacks robust self-checking mechanisms.

Q5: Can episode analysis detect specific failure modes like “hallucinated theorems”?
A: Yes. Hallucinations often appear as Analyze sentences that make confident but false claims, followed by Implement sentences that build on this faulty foundation. The episode pattern would show Analyze→Implement transitions without intervening Verify or Monitor. Comparing your model’s Analyze→Verify ratio to baselines can quantify this risk.

Q6: Is the 78% prediction accuracy for correctness sufficient for practical applications?
A: It depends on use case. For early stopping in low-stakes scenarios (homework help), 78% is valuable—you can abort likely-wrong trajectories early. For safety-critical applications (medical diagnosis), you’d want >95% and would combine episode features with other signals. The framework’s value is more in diagnostics than real-time control.

Q7: How does ThinkARM handle models with different tokenizers or languages?
A: The framework uses the GPT-4 tokenizer (‘cl100k_base’) for token counting consistency. For multilingual analysis, you should verify that sentence segmentation works correctly for your language. The episode definitions are conceptual and should translate, but the annotation model’s understanding of linguistic cues may vary. Always validate against a small human-annotated sample first.

Q8: Can I extend the episode taxonomy beyond the eight categories?
A: The framework is extensible, but proceed with caution. Adding “Debug” or “Backtrack” episodes might capture AI-specific behaviors. However, the automatic annotator was trained on the eight-category taxonomy; expanding it requires re-annotating the gold standard and re-validating GPT-5’s performance. For most research questions, the existing taxonomy plus transition analysis is sufficient.

Q9: What’s the minimum dataset size needed to get reliable episode statistics?
A: For stable episode ratio estimates, you need at least 50-100 diverse problems. For transition patterns, 1,000+ sentences is recommended. The full 100-problem, 15-model study provides 410,991 sentences, which yields highly reliable statistics. If you’re analyzing a specialized domain, aim for 10+ problems per domain category.

Q10: How do efficiency methods like pruning affect episode distributions?
A: ThinkARM analysis shows that aggressive pruning (L1, ThinkPrune) disproportionately targets Verify and Analyze episodes, destroying the metacognitive loops essential for reliability. This explains why compressed models often show surprising failures—they’ve lost their inner critic. When evaluating efficiency methods, always check if they preserve Verify→Monitor and Explore→Analyze transitions.