From Quick Guesses to Thoughtful Drafts: How MetaStone-S1 Makes a 32 B Model Rival OpenAI o3-mini

1. Why Do Large Language Models Need Draft Paper?

Imagine you are taking a tough math final.
If you must write the final answer in one shot, you will probably lose points.
Give yourself scratch paper, let yourself jot down three different approaches, and then hand in the cleanest version—your score jumps.

Large language models (LLMs) face the same problem.
Traditional models generate one answer and stop.
A newer idea called Test-Time Scaling (TTS) lets the model create many “draft solutions” at inference time, score them, and return the best one.

MetaStone-S1 (nick-named XBai o4 in the open-source release) pushes this idea further:

32 billion parameters only
No human step-by-step labels required
One network drafts, scores, and chooses
Matches or beats OpenAI o3-mini-medium on math, coding, and Chinese reasoning tasks

Below you will find the entire story in everyday terms, plus copy-paste commands to run it yourself.

2. Two Flavors of Test-Time Scaling

Approach	How It Works	Pain Points
Internal TTS (long chain-of-thought)	The model writes a very long monologue before the final answer	Can “over-think,” burns tokens, still blind to its own errors
External TTS (draft + scorer)	Generate many short drafts in parallel, use a second model to pick the best	Needs a separate scoring model—extra cost and latency

MetaStone-S1 keeps the external spirit but removes the extra cost.

3. Meet the “Reflective Generative Form”

Think of one neural network wearing two hats:

Policy Hat – writes k different scratch solutions inside special <think> … </think> blocks
Scorer Hat – reads each scratch line-by-line and returns a quality score

The same transformer body is used for both hats.
Only 53 M extra weights (a tiny classifier head) are added on top of a 32 B model.

3.1 How the scorer learns without human labels

Classic Process Reward Models (PRMs) need humans to mark every step right or wrong.
MetaStone-S1’s Self-supervised Process Reward Model (SPRM) only needs the final answer label (correct / wrong).

Training trick:

For every draft, compute a score for each step.
Compare the weighted average of step scores to the final answer.
Keep the gradient only when the model’s own guess agrees with the final label.
Over time the score gap between good and bad steps widens—researchers call this the “aha moment”.

4. Three-Step Inference Pipeline

Step	Purpose	Example Command
1. Start scorer API	Turns the SPRM head into a micro-service	`python test/score_model_queue.py …`
2. Start generator API	Serves the policy head	`python test/policy_model_queue.py …`
3. Run benchmark	Sends questions to both APIs, picks best draft	`python test/inference.py …`

All scripts are in the official repo.
They work out-of-the-box on a single A100 or even an RTX 4090 if you pick the 1.5 B variant.

5. What the Numbers Say

5.1 Benchmarks used

AIME24 & AIME25 – elite high-school math contests
LiveCodeBench v5 – real programming problems from LeetCode, AtCoder, CodeForces
C-EVAL – Chinese language and reasoning tasks

Metric: Pass@1 – one answer per question, average over 64 random seeds.

5.2 Main results (averaged)

Model	AIME24	AIME25	LiveCodeBench v5	C-EVAL
OpenAI o3-mini-medium	79.6	74.8	66.3	75.9
MetaStone-S1 32 B – high	85.2	73.6	64.2	89.7
Delta	+5.6	–1.2	–2.1	+13.8

MetaStone-S1 wins on math and Chinese tasks, stays competitive on code.

5.3 Smaller footprints still punch above their weight

Size	Model	AIME24	Beats
1.5 B	MetaStone-S1-high	57.9	R1-Distill-Qwen-7B
7 B	MetaStone-S1-high	70.2	QwQ-32B

This means students or indie developers with limited GPUs can still run a strong reasoner.

6. Deeper Dive: The “Aha Moment” and Scaling Laws

6.1 The aha moment visualized

Early in training, good and bad scratch solutions receive almost the same score.
Around 10 k–50 k training steps (depending on model size) the curves suddenly split—researchers plot this as the green dashed line below.

aha-moment
Figure 4 in the paper – score trajectories before and after the split

After the split, the SPRM reliably assigns low scores to faulty algebra or buggy code.

6.2 Scaling law

Define compute budget C = model parameters × total draft tokens.
Across 1.5 B, 7 B, and 32 B versions, accuracy grows roughly as log(C) until about 32× the base token budget.
Beyond that, gains flatten—so Best-of-32 is the sweet spot.

7. Ablation: Why Not Just Use a Big External PRM?

Scoring Model	Extra Params	AIME24 1.5 B	AIME24 7 B
Qwen2.5-Math-PRM	72 B	56.7	68.8
SPRM (ours)	5 M / 26 M	57.9	70.2

A 72-billion-parameter external scorer is slower, costlier, and weaker than the tiny SPRM head inside MetaStone-S1.

8. Zero-Shot Generalization

The SPRM head was trained only on math problems.
Without further tuning, it is dropped into LiveCodeBench (code) and C-EVAL (Chinese) and still outperforms separate reward models.
This suggests the head learns domain-agnostic reasoning patterns rather than narrow tricks.

9. Quick Start Guide

Below are the exact commands from the repo README.

9.1 Install

conda create -n xbai_o4 python=3.10
conda activate xbai_o4
pip install -e verl
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1

9.2 Single-GPU training (example)

export WANDB_API_KEY=YOUR_KEY
bash ./scripts/run_single_node.sh

9.3 Convert checkpoint to Hugging Face format

cd ./verl/scripts
bash model_merger.sh

9.4 Evaluate on AIME24 (Best-of-2 low mode)

# scorer API
CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py \
  --model_path path/to/XBai-o4 \
  --score_model_dim 1536 --lang en \
  --ip 0.0.0.0 --port 8001

# generator API
CUDA_VISIBLE_DEVICES=1 python test/policy_model_queue.py \
  --model_path path/to/XBai-o4 \
  --ip 0.0.0.0 --port 8000

# run inference
python test/inference.py \
  --task aime24 --input_file data/aime24.jsonl \
  --output_file result.jsonl --n_samples 16 \
  --branch 2 \
  --response_api_url "http://localhost:8000" \
  --score_api_url "http://localhost:8001/score"

# compute score
python test/compute_metric.py \
  --task aime24 --result_paths result.jsonl --N 2

Replace N with 2, 8, 32 to reproduce low, medium, high modes respectively.

10. Frequently Asked Questions

Q1: I only have a 24 GB gaming card. Can I run the 32 B model?
Not for training, but inference fits with 8-bit quantization. Use the 1.5 B or 7 B variant for full-precision training.

Q2: Do I need human labels for my own dataset?
No. Provide only the final answer (correct / wrong). The SPRM head learns step-level signals automatically.

Q3: How is this different from DeepSeek-R1?
R1 uses internal long chain-of-thought; MetaStone-S1 uses external parallel drafts + shared scorer. The latter gives better accuracy per parameter.

Q4: Can I plug MetaStone-S1 into an MCTS search?
Yes. The paper demonstrates a light-weight MCTS that uses SPRM scores as node values. Accuracy on AIME24 rises from 39.3 → 52.8 on the 1.5 B model.

11. Take-away Checklist

✅ One network, two hats – no separate 70 B reward model
✅ Label-free training – only correct / wrong answers needed
✅ Outperforms o3-mini-medium on math & Chinese tasks
✅ Runs on a single GPU at 1.5 B/7 B sizes
✅ Apache-2.0 license – commercial use allowed

Citation

If you use this work, please cite:

@misc{wang2025testtimescalingreflectivegenerative,
  title={Test-Time Scaling with Reflective Generative Model},
  author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie},
  year={2025},
  eprint={2507.01951},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2507.01951}
}

MetaStone-S1: How 32B Beats OpenAI o3-mini with Draft Paper Strategy