Teaching One Model Two Ways: How Agentic-R1 Makes Math Both Fast and Accurate

A plain-language walk-through of the DualDistill framework, complete setup guide, and honest look at what still needs work.

A student switching between pen and laptop while solving equations

If you have ever stared at a page-long integral, you know the dilemma:

  • Work it out by hand and risk a careless mistake, or
  • Fire up Python, write a quick script, and hope the logic inside that script is sound.

Large language models face the same fork in the road. Some excel at long, careful reasoning in plain English. Others are wizards at calling a code interpreter to crunch numbers. Agentic-R1, released in July 2025 by Carnegie Mellon University, is the first open-weight model that refuses to choose sides. Instead, it learns when to think and when to type import numpy as np.

Below you will find:

  1. Why a single strategy is never enough
  2. How DualDistill trains one student from two very different teachers
  3. Exact commands to reproduce the model and evaluation pipeline
  4. Real examples of the model switching tactics mid-problem
  5. Current shortcomings and the roadmap ahead

No hype, no jargon, and no external facts beyond what the authors themselves have published.


1 The Two Extremes: Pure Reasoning vs. Pure Tools

Approach Strengths Pain Points
Long-form chain-of-thought (long-CoT) Great for abstract logic, proofs, and symbolic manipulation. Slow; arithmetic errors creep in; can “over-think” and spiral.
Tool-augmented agents (code, calculator, APIs) Accurate numeric results; can run algorithms or simulations. Struggles with high-level insight; code can be syntactically correct yet logically wrong.

Researchers have shown that both skills are needed on the same test set. For example:

  • DeepMath-L problems require exact integers larger than 100 000.
  • Combinatorics300 demands huge factorial counts.
  • Yet MATH500 and AIME still reward careful symbolic reasoning.

Instead of building yet another specialist, the CMU team asked: Can we teach one model to pick the right tool for the job?


2 DualDistill in One Paragraph

DualDistill is a two-stage, low-data recipe:

  1. Teacher distillation

    • Teacher A: OpenHands agent (Claude-3.5-Sonnet) — strong at writing code.
    • Teacher R: DeepSeek-R1 — strong at long, textual reasoning.
    • Both teachers solve the same 2 600 curated math problems.
    • Only the useful parts of their solution traces are stitched together with short transition sentences such as “The code seems off, let me try algebra instead.”
  2. Self-distillation

    • The student (a 7 B model) now tries the same problems on its own.
    • Correct answers are double-checked by the teachers; incorrect ones are corrected.
    • A second, lighter fine-tuning round locks in these corrections.

The final checkpoint is Agentic-R1-7B-SD (“SD” for self-distilled).


3 How Big Is the Gain?

Benchmark Needs Code? DeepSeek-R1-Distill-7B (pure CoT) Qwen2.5-7B-Instruct w/ tools Agentic-R1-7B Agentic-R1-7B-SD
DeepMath-L Yes 56.3 % 34.7 % 59.3 % 65.3 %
Combinatorics300 Yes 44.5 % 28.9 % 49.4 % 52.0 %
MATH500 Mixed 89.2 % 70.2 % 82.4 % 93.3 %
AIME 2025 Mixed 40.7 % 14.7 % 40.7 % 40.7 %
AMC Rarely 84.8 % 51.1 % 82.2 % 85.8 %

Figures taken from the paper’s Table 1 under the 32 768-token budget.

Notice two patterns:

  • On heavy-compute tasks, Agentic-R1 outperforms both single-strategy baselines.
  • On simpler tasks, the model gracefully falls back to concise reasoning, avoiding the extra cost of unnecessary code calls.

4 From Theory to Terminal: Reproducing the Pipeline

Everything below is copied verbatim from the official repository except for minor path clarifications.

4.1 One-Time Setup

# 1. Clone
git clone https://github.com/StigLidu/DualDistill.git
cd DualDistill

# 2. Environment (Python 3.11 recommended)
conda create -n dualdistill python=3.11
conda activate dualdistill

# 3. Dependencies
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

4.2 Download Base Model and Teacher Data

# Pull the 7 B base model
python script/data_script/model_download.py \
  --repo_id deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --local_dir ./models

# Fetch the cleaned teacher trajectories (~2.6 k)
python script/data_script/teacher_data_download.py

4.3 First Fine-Tune: Teacher Distillation

bash script/sft_script/SFT.sh
  • Hardware: 4×A6000 GPUs
  • Wall time: ≈12.7 h
  • Output: ./checkpoints/agentic-r1-7b

4.4 Second Fine-Tune: Self-Distillation

# 1. Spin up an inference server
bash script/eval_script/start_inference_server.sh \
  ./checkpoints/agentic-r1-7b agentic-r1 8080

# 2. Generate 16 roll-outs per problem (slow)
python sft/self_distillation_sampler.py \
  --server_url http://localhost:8080/v1 \
  --model_name agentic-r1 \
  --model_path ./checkpoints/agentic-r1-7b \
  --save_path ./traj_cache/

# 3. Package new training set
python script/data_script/extract_training_solution.py
python script/data_script/processing_self_distillation_traj.py

# 4. Short second fine-tune
bash script/sft_script/expert_iteration.sh \
  ./checkpoints/agentic-r1-7b ./traj_cache ./checkpoints/agentic-r1-7b-sd

4.5 Evaluation

# Start server (same as above)
bash script/eval_script/start_inference_server.sh \
  ./checkpoints/agentic-r1-7b-sd agentic-r1-sd 8080

# Run on any split
bash script/eval_script/eval_remote_server.sh \
  "http://localhost:8080/v1" "agentic-r1-sd" \
  "dataset/test/math.json" "true" "4096"

The script returns exact-match accuracy under the specified token budget.


5 Real Trajectory Snapshots

Example A: Code → Text

User prompt:
Evaluate the limit
[
\lim_{n\to\infty}\sqrt[n]{n^{4n}+(4n)^n}\left[\left(2+\frac{1}{n^2}\right)^{18}-\left(4+\frac{4}{n^2}\right)^9\right]
]

Model behavior (abridged):

  1. Tries numeric evaluation in Python.
  2. Hits floating-point noise.
  3. Switches: “Wait, the code is not correct, let’s try text reasoning.”
  4. Derives exact symbolic limit 589 824.

Example B: Text → Code

User prompt:
Count the 26-tuples (k1…k26) where each ki ∈ {0,1,3} and the sum is 15.

Model behavior (abridged):

  1. Attempts inclusion–exclusion on paper.
  2. Realizes enumeration is tedious: “Use text reasoning is too tedious, let’s try code reasoning.”
  3. Implements a 27×16 DP table in Python.
  4. Returns 853 423 740 in under 2 s.

Both trajectories are included in the paper’s appendix and demonstrate the model’s learned policy:

  • Use tools when numbers explode.
  • Fall back to algebraic insight when the symbolic path is clearer.

6 Ablation: Does Composition Really Matter?

To verify that stitched trajectories outperform pure teacher traces, the authors trained a variant without composition. Results under the 32 k budget:

Dataset No Composition With Composition
DeepMath-L 40.0 % 59.3 %
AIME 34.0 % 40.7 %
AMC 50.8 % 82.2 %

Composition is consistently better, confirming that exposing the student to when a strategy switch occurs is more valuable than more single-strategy data.


7 Current Limitations (Straight from the Authors)

  1. Hand-crafted transition sentences
    Phrases such as “Wait, let’s switch to code” are manually written. They occasionally sound stilted and could be made smoother.

  2. Small curriculum size
    2 600 examples are enough for a model already pretrained on both code and text, but would be insufficient to teach a new modality from scratch.

  3. Domain scope
    All experiments are on math. Physics word problems, data analysis, or business logic remain untested.


8 How to Cite and Where to Find the Data

All artifacts are released under the MIT license.

BibTeX:

@article{du2025agentic,
  title={Agentic-R1: Distilled Dual-Strategy Reasoning},
  author={Du, Weihua and Aggarwal, Pranjal and Welleck, Sean and Yang, Yiming},
  journal={arXiv preprint arXiv:2507.05707},
  year={2025}
}

9 Take-away

Agentic-R1 does not try to outrun calculators or out-reason mathematicians. Instead, it learns a simple meta-skill: look at the problem, then pick the right tool for that problem. In doing so, a 7 B model punches above its weight class on tasks that traditionally required either massive compute or human intuition.

If you are building applications that need reliable numeric answers and symbolic insight, the open-source pipeline above is a practical place to start.