R-Zero: Teaching Large Language Models to Reason—Without Any Data

A step-by-step guide for practitioners who want a self-improving LLM that starts from nothing but a base checkpoint.


1. The Problem We All Share

Training a model to reason has always looked like this:

  1. Collect thousands of exam questions.
  2. Pay experts to write detailed, correct answers.
  3. Fine-tune the model on those answers.
  4. Hope the model generalises.

That pipeline is slow, expensive, and hard to scale. R-Zero removes steps 1–2 entirely. It shows how one base model can act as both teacher and student, producing its own curriculum and steadily getting better—no human labels required.


2. A 60-Second Overview of R-Zero

Imagine two copies of the same model:

Role Nickname Job Description
Challenger “Question Writer” Generates new problems that are just hard enough for the Solver.
Solver “Student” Solves those problems and learns from its own mistakes.

They take turns:

  1. Challenger round – frozen Solver, training Challenger.
  2. Solver round – frozen Challenger, training Solver.
  3. Repeat.

After three full cycles the model outperforms the original baseline on both math and broad reasoning tasks.


3. Why It Works

3.1 The “50 % Sweet Spot”

Learning research shows that humans and machines learn fastest when success sits around 50 %. R-Zero turns that insight into a reward function:

  • Challenger receives high reward when the Solver’s accuracy on a new question is close to 50 %.
  • Too easy (≈ 100 %) and reward drops.
  • Too hard (≈ 0 %) and reward drops.

3.2 Curriculum Without Humans

Traditional curricula are fixed. R-Zero’s curriculum is adaptive: as the Solver gets stronger, the Challenger automatically writes harder questions. This prevents both boredom and impossible tasks.


4. Detailed Training Flow

4.1 Preparing the Environment

git clone https://github.com/Chengsong-Huang/R-Zero.git
cd R-Zero
pip install -r requirements.txt

export STORAGE_PATH="/path/to/fast/disk"   # 100 GB+ recommended
export HUGGINGFACENAME="your_hf_username"

mkdir -p "$STORAGE_PATH"/{evaluation,models,generated_question,temp_results}

4.2 API Keys

  • tokens.json – Hugging Face & Weights & Biases tokens.
  • evaluation/results_recheck.py – OpenAI key for GPT-4o evaluation (used only for benchmarks).

4.3 One-Command Reproduction

bash scripts/main.sh Qwen/Qwen3-4B-Base qwen3-4b

The script runs three complete iterations.
Estimated wall time on 8×A100: < 12 hours.


5. Examining the Results

The paper tests four base models on two task families.

5.1 Math Benchmarks

AMC, AIME-2024/25, MATH-500, GSM8K, Olympiad-Bench, Minerva

Model Base Score After 3 Iterations Gain
Qwen3-4B-Base 42.58 49.07 +6.49
Qwen3-8B-Base 49.18 54.69 +5.51
OctoThinker-3B 26.64 29.32 +2.68
OctoThinker-8B 32.11 38.52 +6.41

5.2 General-Domain Reasoning

MMLU-Pro, SuperGPQA, BBEH

Model Base Score After 3 Iterations Gain
Qwen3-4B-Base 27.10 34.64 +7.54
Qwen3-8B-Base 34.49 38.73 +4.24
OctoThinker-3B 12.27 15.67 +3.40
OctoThinker-8B 16.81 26.88 +10.07

Key insight: math-focused training transfers cleanly to general tasks.


6. Inside the Algorithm

6.1 Group Relative Policy Optimisation (GRPO)

Instead of a separate value network, GRPO:

  • Samples G answers for the same prompt.
  • Normalises rewards with a z-score across the group.
  • Updates the policy with a PPO-style clipped objective.

This keeps training stable without extra memory.

6.2 Reward Design for the Challenger

The total reward for a new question is:

r = max(0, r_uncertainty – r_repetition)
  • r_uncertainty peaks when Solver accuracy ≈ 50 %.
  • r_repetition penalises near-duplicate questions using BLEU clustering.

6.3 Dataset Filtering

After each Challenger round:

  1. Generate 8 000 candidate questions.
  2. Let the current Solver answer each 10 times.
  3. Keep questions where 3–7 answers agree (≈ 50 % consistency).
    This removes trivial or ambiguous items.

7. Ablation Study: What Actually Matters?

Removing one component at a time (Qwen3-4B):

Removed Part Math ↓ General ↓
No RL for Challenger -3.70 -4.10
No Repetition Penalty -1.30 -2.85
No Filtering -0.71 -6.15

Take-away: RL for the Challenger and quality filtering are non-negotiable.


8. Interaction with Supervised Data

R-Zero still cooperates with human labels.

Experiment:

  1. Baseline: fine-tune on 10 k labelled examples.
  2. R-Zero first → then fine-tune on the same 10 k examples.

Result: +2.35 extra points.
Interpretation: self-evolution “pre-conditions” the model, making later supervised learning more effective.


9. Installation Troubleshooting Guide

Symptom Likely Cause Fix
Script hangs during question generation math_verify infinite loop Restart; checkpoint auto-resumes
CUDA OOM Batch size too large Lower batch_size in configs/*.yaml
WandB login fails Missing token Add "wandb": "your_key" to tokens.json
Unicode errors Locale mis-match export PYTHONIOENCODING=utf-8

10. Frequently Asked Questions

Q1: How much GPU memory do I need?

A single 80 GB A100 can train 4 B–8 B models. Smaller GPUs work if you reduce batch size and enable gradient checkpointing.

Q2: Can I use a different base model?

Yes. Any Hugging Face causal LM with a ForCausalLM class is compatible. Replace the model name in scripts/main.sh.

Q3: Is the generated dataset public?

Not automatically. Each run writes the questions to generated_question/iter{1,2,3}. You can push to the Hub by setting --push_to_hub.

Q4: Does R-Zero work for languages other than English?

The paper focuses on English math problems, but the framework is language-agnostic. Ensure your base model supports the target language.

Q5: How do I stop after one iteration?

Add --max_iterations 1 to the launch command. The checkpoint is still compatible with later continuation.


11. Limitations and Next Steps

Current Limitation Future Direction
Needs verifiable answers (math/code) Extend to open-ended tasks via learned reward models
Label accuracy falls as questions get harder Use ensembles or human verification for final rounds
Training cost grows linearly with iterations Investigate early stopping and better sample efficiency

12. Ethical and Practical Considerations

  • Data Leakage: Generated questions are new, but may overlap with existing benchmarks. Always run a de-duplication step before production use.
  • Compute Cost: Three iterations of an 8 B model ≈ 300 GPU-hours. Consider carbon offsets or spot instances.
  • Model Release: If you publish derivative weights, include a note that they originate from R-Zero self-evolution.

13. Citation

If you use R-Zero in your work, please cite:

@misc{huang2025rzeroselfevolvingreasoningllm,
  title={R-Zero: Self-Evolving Reasoning LLM from Zero Data},
  author={Chengsong Huang and Wenhao Yu and Xiaoyang Wang and Hongming Zhang and Zongxia Li and Ruosen Li and Jiaxin Huang and Haitao Mi and Dong Yu},
  year={2025},
  eprint={2508.05004},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2508.05004}
}

14. Final Thoughts

R-Zero turns the classic data-hungry pipeline on its head. Instead of asking, “Where can we get more labelled problems?” it asks, “What if the model creates the problems it needs most?”

For educators, researchers, and engineers, that shift opens the door to truly autonomous improvement loops. Clone the repo, run the script, and watch your base model evolve—no textbooks required.