In an era where AI models are ballooning to trillions of parameters, a model smaller than two smartphone photos is defeating giants like DeepSeek-R1 and Gemini 2.5 Pro in the ARC-AGI challenge.

“Is bigger always better?” This question has lingered in artificial intelligence for years. While major tech companies race to release increasingly larger models, Samsung SAIL Montreal’s Alexia Jolicoeur-Martineau took the opposite path.

Her Tiny Recursive Model (TRM) uses just 7 million parameters—smaller than many image classification models—yet achieves 45% accuracy on ARC-AGI-1 and 8% on the more challenging ARC-AGI-2, outperforming competitors with thousands of times more parameters.

Why ARC-AGI Matters in AI Development

ARC-AGI (Abstraction and Reasoning Corpus) is widely regarded as the “ultimate benchmark for true AI.” Unlike traditional benchmarks that rely on massive training data, ARC focuses on measuring abstract reasoning and generalization capabilities—the core of human intelligence.

Imagine this scenario: you see several input-output examples, then need to solve a new problem following the same abstract rules. This is exactly ARC’s task format, and where most LLMs fail dramatically.

Gemini 2.5 Pro, DeepSeek-R1, and o3-mini-high with their billions of parameters perform poorly here not because they aren’t “smart,” but because their reasoning approach differs from humans. They’re like knowledgeable but rigid scholars, while TRM resembles an agile problem-solver.

TRM’s Core Insight: Recursion Beats Parameters

Traditional LLMs use autoregressive generation—once they start outputting, it’s hard to correct mistakes. It’s like walking through fog, where each step builds on the previous one, making errors difficult to recover from.

TRM employs a completely different strategy: the “draft-revise” cycle.

TRM Architecture

TRM’s operation recalls how humans solve complex problems. When facing challenging tasks, we don’t immediately deliver final answers but rather:

  1. Form an initial solution (draft)
  2. Repeatedly check, refine, and improve (revise)
  3. Finalize the answer

Specifically, TRM maintains two core states:

  • Current answer (y): Equivalent to our “draft solution on paper”
  • Reasoning state (z): Similar to our “thought process record”

In each reasoning step, TRM executes:

# Thinking phase: Update reasoning process based on problem, current answer, and reasoning state
for i in range(n):
    z = network(x, y, z)
    
# Action phase: Improve current answer based on reasoning process
y = network(y, z)

This process can repeat up to 16 times, giving the model ample opportunity to check and correct its answers.

Hands-On: Setting Up TRM Development Environment

To experience TRM’s capabilities firsthand, start by setting up the development environment. Here’s the verified configuration process:

# Install PyTorch (adjust based on your CUDA version)
pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126

# Install project dependencies
pip install -r requirements.txt
pip install --no-cache-dir --no-build-isolation adam-atan2

# Configure experiment tracking (optional but recommended)
wandb login YOUR-API-KEY

Dataset preparation is crucial for training:

# Prepare ARC-AGI-1 dataset
python -m dataset.build_arc_dataset \
  --input-file-prefix kaggle/combined/arc-agi \
  --output-dir data/arc1concept-aug-1000 \
  --subsets training evaluation concept \
  --test-set-name evaluation

# Prepare Sudoku Extreme dataset
python dataset/build_sudoku_dataset.py --output-dir data/sudoku-extreme-1k-aug-1000 --subsample-size 1000 --num-aug 1000

TRM Training Practical Guide

Training TRM requires patience and appropriate hardware configuration, but the results justify the effort.

For ARC-AGI-1 (requires 4×H100 GPUs):

run_name="pretrain_att_arc1concept_4"
torchrun --nproc-per-node 4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 --nnodes=1 pretrain.py \
arch=trm \
data_paths="[data/arc1concept-aug-1000]" \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name=${run_name} ema=True

For Sudoku Extreme (only 1×L40S GPU needed):

run_name="pretrain_mlp_t_sudoku"
python pretrain.py \
arch=trm \
data_paths="[data/sudoku-extreme-1k-aug-1000]" \
evaluators="[]" \
epochs=50000 eval_interval=5000 \
lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 \
arch.mlp_t=True arch.pos_encodings=none \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=6 \
+run_name=${run_name} ema=True

Training times range from hours to days, depending on the dataset and hardware configuration.

Fundamental Differences: TRM vs Traditional Approaches

Comparison with HRM (Hierarchical Reasoning Model):

HRM was TRM’s predecessor, using two networks, fixed-point theorems, and complex biologically-inspired designs. TRM simplifies this to:

  • Single-network architecture: Replaces HRM’s dual-network design
  • Full gradient backpropagation: Instead of HRM’s single-step gradient approximation
  • No theoretical overhead: Eliminates dependency on fixed-point theorems

Comparison with LLMs:

While LLMs excel at general language tasks, they have inherent disadvantages in structured reasoning:

  • Autoregressive error accumulation: Token-by-token generation propagates mistakes
  • Lacks revision mechanism: Cannot systematically check and improve answers after generation
  • Poor compute allocation: Too many parameters but insufficient reasoning computation

Results Analysis: Small Model’s Huge Potential

TRM’s performance across benchmarks is impressive:

Task Dataset Size TRM Accuracy Best LLM Accuracy Parameter Ratio
Sudoku-Extreme 1K train/423K test 87.4% 0.0% 7M vs 671B
Maze-Hard 1K train/1K test 85.3% 0.0% 7M vs 671B
ARC-AGI-1 800 tasks 44.6% 37.0% 7M vs ?
ARC-AGI-2 1120 tasks 7.8% 4.9% 7M vs ?

Data sources: Wang et al. (2025) and official TRM paper

Partic noteworthy is that in the Sudoku Extreme challenge, TRM’s MLP variant (only 5 million parameters) achieved an impressive 87.4% accuracy, compared to HRM’s 55.0% and all tested LLMs at 0.0%.

Why “Small” Performs Better?

TRM’s success reveals an important principle in AI reasoning: For certain tasks, reasoning depth matters more than parameter count.

  1. Effective Depth Theory:
    TRM’s effective depth = T × (n + 1) × layers = 3 × (6 + 1) × 2 = 42 layers
    This “recursive depth” enables complex multi-step reasoning

  2. Overfitting Control:
    Small parameter size + extensive data augmentation = better generalization
    This is particularly important in reasoning tasks with scarce data

  3. Compute Budget Reallocation:
    Rather than using computational resources to maintain huge parameter matrices, allocate them to test-time reasoning processes

Architectural Design Wisdom

The TRM team demonstrated deep insight in their architectural choices:

Flexible Use of Attention Mechanism:

  • For small grids (like 9×9 Sudoku), use MLP mixer to reduce overfitting
  • For large grids (like 30×30 mazes and ARC tasks), retain self-attention to capture long-range dependencies

Minimalist Design:

  • Use only 2 layers, discovering they generalize better than 4 layers
  • Single network handles both “thinking” and “action” phases

Training Stability Techniques:

  • Exponential Moving Average (EMA) prevents training divergence on small datasets
  • Stable max loss function improves training reliability

Frequently Asked Questions

Q: Can TRM replace LLMs for general tasks?
A: Currently no. TRM is specifically optimized for reasoning tasks with clear input-output structures. It cannot compete with LLMs in general language understanding, creative writing, and similar tasks.

Q: How much computational resources are needed to train TRM?
A: Relatively moderate. Sudoku experiments complete in 36 hours on a single L40S GPU, while ARC-AGI requires 4 H100 GPUs for about 3 days. Compared to training billion-parameter LLMs, the cost is negligible.

Q: Can TRM’s principles be applied to other domains?
A: Absolutely. The recursive “draft-revise” reasoning paradigm is valuable for any domain requiring multi-step reasoning, such as program synthesis, mathematical proof, and scientific discovery.

Q: How to interpret TRM’s reasoning process?
A: While the z-state is human-unreadable embeddings, analyzing its evolution across recursive steps provides partial insight into the model’s “thought process.” This offers more interpretability potential than LLMs’ black-box generation.

Future Outlook

TRM represents not just an efficient reasoning model, but a paradigm shift in AI development:

  1. Efficiency Revolution: Demonstrates that in specific domains, small models with carefully designed reasoning mechanisms can outperform large models
  2. Specialization Trend: We’ll likely see more compact models optimized for particular tasks
  3. Recursive Reasoning Popularization: The “think-act” cycle may become a standard component in AI systems

Next, the team plans to extend TRM to generative tasks, support multiple answer outputs, and explore more complex recursive patterns.

Conclusion

In today’s AI landscape obsessed with parameter scale, TRM serves as a refreshing reminder: Sometimes, deep thinking matters more than broad knowledge.

This story of a 7-million-parameter “small model” defeating billion-parameter “large models” isn’t just a technical victory—it’s a profound insight into AI development direction: while we race toward bigger, we shouldn’t forget to think smarter.


All code examples and experimental configurations come from the official TRM repository and have been tested for proper operation. Technical details reference the TRM paper and HRM paper.