From Quick Guesses to Thoughtful Drafts: How MetaStone-S1 Makes a 32 B Model Rival OpenAI o3-mini


1. Why Do Large Language Models Need Draft Paper?

Imagine you are taking a tough math final.
If you must write the final answer in one shot, you will probably lose points.
Give yourself scratch paper, let yourself jot down three different approaches, and then hand in the cleanest version—your score jumps.

Large language models (LLMs) face the same problem.
Traditional models generate one answer and stop.
A newer idea called Test-Time Scaling (TTS) lets the model create many “draft solutions” at inference time, score them, and return the best one.

MetaStone-S1 (nick-named XBai o4 in the open-source release) pushes this idea further:

  • 32 billion parameters only
  • No human step-by-step labels required
  • One network drafts, scores, and chooses
  • Matches or beats OpenAI o3-mini-medium on math, coding, and Chinese reasoning tasks

Below you will find the entire story in everyday terms, plus copy-paste commands to run it yourself.


2. Two Flavors of Test-Time Scaling

Approach How It Works Pain Points
Internal TTS (long chain-of-thought) The model writes a very long monologue before the final answer Can “over-think,” burns tokens, still blind to its own errors
External TTS (draft + scorer) Generate many short drafts in parallel, use a second model to pick the best Needs a separate scoring model—extra cost and latency

MetaStone-S1 keeps the external spirit but removes the extra cost.


3. Meet the “Reflective Generative Form”

Think of one neural network wearing two hats:

  1. Policy Hat – writes k different scratch solutions inside special <think></think> blocks
  2. Scorer Hat – reads each scratch line-by-line and returns a quality score

The same transformer body is used for both hats.
Only 53 M extra weights (a tiny classifier head) are added on top of a 32 B model.

3.1 How the scorer learns without human labels

Classic Process Reward Models (PRMs) need humans to mark every step right or wrong.
MetaStone-S1’s Self-supervised Process Reward Model (SPRM) only needs the final answer label (correct / wrong).

Training trick:

  1. For every draft, compute a score for each step.
  2. Compare the weighted average of step scores to the final answer.
  3. Keep the gradient only when the model’s own guess agrees with the final label.
  4. Over time the score gap between good and bad steps widens—researchers call this the “aha moment”.

4. Three-Step Inference Pipeline

Step Purpose Example Command
1. Start scorer API Turns the SPRM head into a micro-service python test/score_model_queue.py …
2. Start generator API Serves the policy head python test/policy_model_queue.py …
3. Run benchmark Sends questions to both APIs, picks best draft python test/inference.py …

All scripts are in the official repo.
They work out-of-the-box on a single A100 or even an RTX 4090 if you pick the 1.5 B variant.


5. What the Numbers Say

5.1 Benchmarks used

  • AIME24 & AIME25 – elite high-school math contests
  • LiveCodeBench v5 – real programming problems from LeetCode, AtCoder, CodeForces
  • C-EVAL – Chinese language and reasoning tasks

Metric: Pass@1 – one answer per question, average over 64 random seeds.

5.2 Main results (averaged)

Model AIME24 AIME25 LiveCodeBench v5 C-EVAL
OpenAI o3-mini-medium 79.6 74.8 66.3 75.9
MetaStone-S1 32 B – high 85.2 73.6 64.2 89.7
Delta +5.6 –1.2 –2.1 +13.8

MetaStone-S1 wins on math and Chinese tasks, stays competitive on code.

5.3 Smaller footprints still punch above their weight

Size Model AIME24 Beats
1.5 B MetaStone-S1-high 57.9 R1-Distill-Qwen-7B
7 B MetaStone-S1-high 70.2 QwQ-32B

This means students or indie developers with limited GPUs can still run a strong reasoner.


6. Deeper Dive: The “Aha Moment” and Scaling Laws

6.1 The aha moment visualized

Early in training, good and bad scratch solutions receive almost the same score.
Around 10 k–50 k training steps (depending on model size) the curves suddenly split—researchers plot this as the green dashed line below.

aha-moment
Figure 4 in the paper – score trajectories before and after the split

After the split, the SPRM reliably assigns low scores to faulty algebra or buggy code.

6.2 Scaling law

Define compute budget C = model parameters × total draft tokens.
Across 1.5 B, 7 B, and 32 B versions, accuracy grows roughly as log(C) until about 32× the base token budget.
Beyond that, gains flatten—so Best-of-32 is the sweet spot.


7. Ablation: Why Not Just Use a Big External PRM?

Scoring Model Extra Params AIME24 1.5 B AIME24 7 B
Qwen2.5-Math-PRM 72 B 56.7 68.8
SPRM (ours) 5 M / 26 M 57.9 70.2

A 72-billion-parameter external scorer is slower, costlier, and weaker than the tiny SPRM head inside MetaStone-S1.


8. Zero-Shot Generalization

The SPRM head was trained only on math problems.
Without further tuning, it is dropped into LiveCodeBench (code) and C-EVAL (Chinese) and still outperforms separate reward models.
This suggests the head learns domain-agnostic reasoning patterns rather than narrow tricks.


9. Quick Start Guide

Below are the exact commands from the repo README.

9.1 Install

conda create -n xbai_o4 python=3.10
conda activate xbai_o4
pip install -e verl
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1

9.2 Single-GPU training (example)

export WANDB_API_KEY=YOUR_KEY
bash ./scripts/run_single_node.sh

9.3 Convert checkpoint to Hugging Face format

cd ./verl/scripts
bash model_merger.sh

9.4 Evaluate on AIME24 (Best-of-2 low mode)

# scorer API
CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py \
  --model_path path/to/XBai-o4 \
  --score_model_dim 1536 --lang en \
  --ip 0.0.0.0 --port 8001

# generator API
CUDA_VISIBLE_DEVICES=1 python test/policy_model_queue.py \
  --model_path path/to/XBai-o4 \
  --ip 0.0.0.0 --port 8000

# run inference
python test/inference.py \
  --task aime24 --input_file data/aime24.jsonl \
  --output_file result.jsonl --n_samples 16 \
  --branch 2 \
  --response_api_url "http://localhost:8000" \
  --score_api_url "http://localhost:8001/score"

# compute score
python test/compute_metric.py \
  --task aime24 --result_paths result.jsonl --N 2

Replace N with 2, 8, 32 to reproduce low, medium, high modes respectively.


10. Frequently Asked Questions

Q1: I only have a 24 GB gaming card. Can I run the 32 B model?
Not for training, but inference fits with 8-bit quantization. Use the 1.5 B or 7 B variant for full-precision training.

Q2: Do I need human labels for my own dataset?
No. Provide only the final answer (correct / wrong). The SPRM head learns step-level signals automatically.

Q3: How is this different from DeepSeek-R1?
R1 uses internal long chain-of-thought; MetaStone-S1 uses external parallel drafts + shared scorer. The latter gives better accuracy per parameter.

Q4: Can I plug MetaStone-S1 into an MCTS search?
Yes. The paper demonstrates a light-weight MCTS that uses SPRM scores as node values. Accuracy on AIME24 rises from 39.3 → 52.8 on the 1.5 B model.


11. Take-away Checklist

One network, two hats – no separate 70 B reward model
Label-free training – only correct / wrong answers needed
Outperforms o3-mini-medium on math & Chinese tasks
Runs on a single GPU at 1.5 B/7 B sizes
Apache-2.0 license – commercial use allowed


Citation

If you use this work, please cite:

@misc{wang2025testtimescalingreflectivegenerative,
  title={Test-Time Scaling with Reflective Generative Model},
  author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie},
  year={2025},
  eprint={2507.01951},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2507.01951}
}