R-Zero: Teaching Large Language Models to Reason—Without Any Data
“
A step-by-step guide for practitioners who want a self-improving LLM that starts from nothing but a base checkpoint.
1. The Problem We All Share
Training a model to reason has always looked like this:
-
Collect thousands of exam questions. -
Pay experts to write detailed, correct answers. -
Fine-tune the model on those answers. -
Hope the model generalises.
That pipeline is slow, expensive, and hard to scale. R-Zero removes steps 1–2 entirely. It shows how one base model can act as both teacher and student, producing its own curriculum and steadily getting better—no human labels required.
2. A 60-Second Overview of R-Zero
Imagine two copies of the same model:
They take turns:
-
Challenger round – frozen Solver, training Challenger. -
Solver round – frozen Challenger, training Solver. -
Repeat.
After three full cycles the model outperforms the original baseline on both math and broad reasoning tasks.
3. Why It Works
3.1 The “50 % Sweet Spot”
Learning research shows that humans and machines learn fastest when success sits around 50 %. R-Zero turns that insight into a reward function:
-
Challenger receives high reward when the Solver’s accuracy on a new question is close to 50 %. -
Too easy (≈ 100 %) and reward drops. -
Too hard (≈ 0 %) and reward drops.
3.2 Curriculum Without Humans
Traditional curricula are fixed. R-Zero’s curriculum is adaptive: as the Solver gets stronger, the Challenger automatically writes harder questions. This prevents both boredom and impossible tasks.
4. Detailed Training Flow
4.1 Preparing the Environment
git clone https://github.com/Chengsong-Huang/R-Zero.git
cd R-Zero
pip install -r requirements.txt
export STORAGE_PATH="/path/to/fast/disk" # 100 GB+ recommended
export HUGGINGFACENAME="your_hf_username"
mkdir -p "$STORAGE_PATH"/{evaluation,models,generated_question,temp_results}
4.2 API Keys
-
tokens.json
– Hugging Face & Weights & Biases tokens. -
evaluation/results_recheck.py
– OpenAI key for GPT-4o evaluation (used only for benchmarks).
4.3 One-Command Reproduction
bash scripts/main.sh Qwen/Qwen3-4B-Base qwen3-4b
The script runs three complete iterations.
Estimated wall time on 8×A100: < 12 hours.
5. Examining the Results
The paper tests four base models on two task families.
5.1 Math Benchmarks
AMC, AIME-2024/25, MATH-500, GSM8K, Olympiad-Bench, Minerva
5.2 General-Domain Reasoning
MMLU-Pro, SuperGPQA, BBEH
Key insight: math-focused training transfers cleanly to general tasks.
6. Inside the Algorithm
6.1 Group Relative Policy Optimisation (GRPO)
Instead of a separate value network, GRPO:
-
Samples G answers for the same prompt. -
Normalises rewards with a z-score across the group. -
Updates the policy with a PPO-style clipped objective.
This keeps training stable without extra memory.
6.2 Reward Design for the Challenger
The total reward for a new question is:
r = max(0, r_uncertainty – r_repetition)
-
r_uncertainty peaks when Solver accuracy ≈ 50 %. -
r_repetition penalises near-duplicate questions using BLEU clustering.
6.3 Dataset Filtering
After each Challenger round:
-
Generate 8 000 candidate questions. -
Let the current Solver answer each 10 times. -
Keep questions where 3–7 answers agree (≈ 50 % consistency).
This removes trivial or ambiguous items.
7. Ablation Study: What Actually Matters?
Removing one component at a time (Qwen3-4B):
Take-away: RL for the Challenger and quality filtering are non-negotiable.
8. Interaction with Supervised Data
R-Zero still cooperates with human labels.
Experiment:
-
Baseline: fine-tune on 10 k labelled examples. -
R-Zero first → then fine-tune on the same 10 k examples.
Result: +2.35 extra points.
Interpretation: self-evolution “pre-conditions” the model, making later supervised learning more effective.
9. Installation Troubleshooting Guide
10. Frequently Asked Questions
Q1: How much GPU memory do I need?
A single 80 GB A100 can train 4 B–8 B models. Smaller GPUs work if you reduce batch size and enable gradient checkpointing.
Q2: Can I use a different base model?
Yes. Any Hugging Face causal LM with a ForCausalLM
class is compatible. Replace the model name in scripts/main.sh
.
Q3: Is the generated dataset public?
Not automatically. Each run writes the questions to generated_question/iter{1,2,3}
. You can push to the Hub by setting --push_to_hub
.
Q4: Does R-Zero work for languages other than English?
The paper focuses on English math problems, but the framework is language-agnostic. Ensure your base model supports the target language.
Q5: How do I stop after one iteration?
Add --max_iterations 1
to the launch command. The checkpoint is still compatible with later continuation.
11. Limitations and Next Steps
12. Ethical and Practical Considerations
-
Data Leakage: Generated questions are new, but may overlap with existing benchmarks. Always run a de-duplication step before production use. -
Compute Cost: Three iterations of an 8 B model ≈ 300 GPU-hours. Consider carbon offsets or spot instances. -
Model Release: If you publish derivative weights, include a note that they originate from R-Zero self-evolution.
13. Citation
If you use R-Zero in your work, please cite:
@misc{huang2025rzeroselfevolvingreasoningllm,
title={R-Zero: Self-Evolving Reasoning LLM from Zero Data},
author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.05004},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.05004}
}
14. Final Thoughts
R-Zero turns the classic data-hungry pipeline on its head. Instead of asking, “Where can we get more labelled problems?” it asks, “What if the model creates the problems it needs most?”
For educators, researchers, and engineers, that shift opens the door to truly autonomous improvement loops. Clone the repo, run the script, and watch your base model evolve—no textbooks required.