How a 7-Billion-Parameter Model Cracked Olympiad Programming: Inside Microsoft’s rStar-Coder
In May 2025, a research team quietly released a data set that changed the conversation around small language models (SLMs) and competitive programming. Named rStar-Coder, the project delivers 418 000 verified competition-grade code problems and 580 000 step-by-step reasoning solutions. When the team fine-tuned the modest Qwen2.5-Coder-7B on this data, the model leapt from 23 % to 62.5 % on LiveCodeBench—outperforming OpenAI o3-mini (low) and even QWQ-32B, a 32-billion-parameter powerhouse that generated the training rationales in the first place.
This article explains—without marketing fluff—how the authors built the data set, why it works, and how you can reproduce or extend their pipeline. Everything below is drawn directly from the arXiv paper and the accompanying repository; no external facts have been added.
The Challenge: Verified Data at Scale
Competitive programming is an acid-test for code reasoning. A correct submission must:
-
Compile and run without crashes. -
Pass hidden test cases that stress both correctness and time/memory limits. -
Handle edge cases that are rarely disclosed in public statements.
Traditional data sets fall short:
Source | Size | Test Coverage | Limitation |
---|---|---|---|
Codeforces public archive | 50 k+ problems | 2-3 sample cases | Missing large or adversarial inputs |
WizardCoder/Magicoder | 400 k+ prompts | Usually none | Focus on basic function completion |
HumanEval | 164 tasks | 10 unit tests | Too small for deep reasoning |
The rStar-Coder team set out to synthesize new problems that are both difficult and verifiable, then pair each problem with diverse, reliable test cases.
Three Core Building Blocks
1. Curated Seed Problems → Synthetic Variants
Step 1: Harvest
The authors collected 57 215 competition problems from:
-
Codeforces -
AtCoder -
CodeChef -
LeetCode -
USA Computing Olympiad (USACO) -
International Olympiad in Informatics (IOI)
After de-duplication and filtering out tasks without reference solutions, 37 754 high-quality seeds remained.
Step 2: Rewrite
Instead of prompting a large model with “generate a new problem”, the authors provided both the original statement and its official solution. The prompt instructed the model to:
-
Extract the key algorithmic insight. -
Create a fresh narrative that tests the same skill. -
Specify new constraints to keep the task solvable yet non-trivial.
This produced 1.57 M candidate problems. Roughly 380 k survived later filtering, giving an 8-to-1 synthesis ratio.
Example transformation
Original: Minimum spanning tree on an undirected graph.
Rewritten: A city wants to connect neighborhoods with roads, but each road has both cost and environmental impact. Minimize total cost while respecting an environmental budget.
2. Reliable Test-Case Generation
Generating inputs is easy; generating correct outputs without an oracle is not. rStar-Coder decouples the process into three clear stages:
Stage | Purpose | Tooling |
---|---|---|
1 | Produce valid, scalable inputs | GPT-4o writes two utility scripts |
2 | Control difficulty | Logarithmic scale sampling (1 → 100 000) |
3 | Validate outputs | Mutual-verification majority vote |
Stage 1: Utility Scripts
For every problem, the model outputs:
-
generate_test_input(scale)
—uses the open-source CYaRon library to emit a properly formatted input string. -
validate_test_input(raw_string)
—parses and checks constraints, returning True or False.
Stage 2: Scale Sampling
Input sizes are chosen from
{1,2,…,9} ∪ {10⁰,10¹,…,10⁵}
to guarantee both tiny edge cases and full-scale stress tests.
Stage 3: Output Labelling via Mutual Verification
Because synthetic problems have no ground-truth solution, the team:
-
Samples 16 long-chain-of-thought solutions from a frontier model (QWQ-32B). -
Runs each solution on ≥50 inputs. -
Keeps the input-output pair only if ≥60 % of solutions agree on the same output for every input. -
Keeps the solution if it passes all retained inputs.
The vote acts as a natural filter: incorrect solutions tend to diverge, whereas correct ones converge.
Empirical reliability
Across 3 150 sampled inputs, mutual verification achieved 96.8 % accuracy compared with 12.7 % when GPT-4o wrote outputs directly.
3. Augmenting Expert Problems with Rich Reasoning
Even the 37 k seed problems often ship with concise reference code—perfectly correct but lacking the reflective reasoning that teaches models how to think. The authors:
-
Re-generated 16 chain-of-thought solutions per seed problem. -
Validated each solution against the newly created, large-scale test suite. -
Retained only solutions that passed all tests. -
For problems where QWQ-32B repeatedly failed, they kept all traces to preserve partial insights.
The final mix is:
-
37 754 seed problems × 1–16 verified solutions -
380 560 synthetic problems × 1 fastest solution (by CPU time)
De-duplication against public benchmarks (HumanEval, MBPP, LiveCodeBench, USACO 2025) removed any 16-gram overlap, ensuring fair evaluation.
Training Recipe
Hyper-parameter | Value |
---|---|
Base model | Qwen2.5-Coder-Instruct (1.5 B / 7 B / 14 B) |
Epochs | 6 |
Global batch | 96 |
Max length | 16 384 tokens |
Optimizer | AdamW, 4 e-5, cosine decay |
Hardware | 8× or 32× MI300X AMD GPUs |
Framework | FlashAttention-2 + DeepSpeed ZeRO-0 |
No reinforcement learning or distillation was used—pure supervised fine-tuning on the curated data.
Evaluation Highlights
LiveCodeBench v5 (2024-08 to 2025-02)
Model | Parameters | Pass@1 |
---|---|---|
GPT-4o | ~8 B (est.) | 30.0 % |
o3-mini (low) | unknown | 59.4 % |
QWQ-32B | 32 B | 63.4 % |
rStar-Coder-14B | 14 B | 62.5 % |
rStar-Coder-7B | 7 B | 57.3 % |
rStar-Coder-1.5B | 1.5 B | 40.1 % |
USA Computing Olympiad 2025
Model | Bronze | Silver | Gold | Platinum | Avg |
---|---|---|---|---|---|
OpenAI o3 | 72.9 % | 28.1 % | 27.1 % | 0 % | 32.0 % |
DeepSeek-R1 | 58.3 % | 22.9 % | 6.2 % | 0 % | 21.9 % |
QWQ-32B | 43.8 % | 12.5 % | 6.3 % | 0 % | 15.6 % |
rStar-Coder-7B | 47.9 % | 4.2 % | 12.5 % | 0 % | 16.2 % |
rStar-Coder-14B | 47.9 % | 12.5 % | 8.3 % | 0 % | 17.2 % |
While absolute scores look low, remember that USACO 2025 Platinum problems are designed for high-school national-team selection; even frontier models rarely solve them.
Ablation Studies
Problem Diversity vs. Solution Count
Fixing 37 k seed problems and varying the number of long-CoT solutions per problem:
Unique Problems | Solutions per Problem | Total Examples | LiveCodeBench | USACO |
---|---|---|---|---|
37 754 | 1 | 37 754 | 40.8 % | 0.52 % |
37 754 | 8 | 302 032 | 51.1 % | 3.64 % |
37 754 | 16 | 603 064 | 54.7 % | 10.4 % |
418 314 (mixed) | ≈1.4 | 580 000 | 57.3 % | 16.2 % |
Take-away: Increasing problem coverage yields better returns than merely multiplying solutions for the same tasks.
Input Generation Quality
Direct GPT-4o prompting produced inputs clustered at small scales (10⁰–10²).
The three-step pipeline shifted the distribution up to 10⁵, improving final model performance across easy, medium, and hard LiveCodeBench splits.
Reproduction Guide (Minimal Set-Up)
The official repository is fully open-source. Below is a condensed walk-through for an Ubuntu 22.04 + A100 80 GB node.
1. Environment
# Python 3.11
conda create -y -n rstar python=3.11
conda activate rstar
# Core requirements
pip install --upgrade pip
pip install -r requirements.txt # from the repo
pip install flash-attn --no-build-isolation # optional speed-up
2. Evaluation Toolkit
git clone https://github.com/MARIO-Math-Reasoning/MARIO_EVAL.git
cd MARIO_EVAL && pip install -e . && cd ..
3. Generate New Problems (Optional)
MODEL=deepseek-ai/DeepSeek-Coder-V2-Instruct
CFG=config/sample_mcts.yaml
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
--qaf seed_problems.jsonl \
--custom_cfg $CFG \
--model_dir $MODEL
4. Build Test Cases for a Single Problem
# generate_test_input.py (auto-created by GPT-4o)
import cyaron as cy
def generate_test_input(n):
if not (1 <= n <= 100000):
return None
a = cy.Vector.random(n, [(1, 10**9)])
return f"{n}\n{' '.join(map(str, a))}"
# validate_test_input.py
def validate_test_input(s):
lines = s.strip().split('\n')
if len(lines) != 2:
return False
n = int(lines[0])
nums = list(map(int, lines[1].split()))
return len(nums) == n and all(1 <= x <= 10**9 for x in nums)
# Emit 50 inputs of varying sizes
for scale in [1, 10, 100, 1000, 10000, 100000]:
inp = generate_test_input(scale)
if validate_test_input(inp):
open(f"tests/{scale}.in", "w").write(inp)
5. Mutual Verification Script (Pseudo-code)
from tqdm import tqdm
inputs = open("tests/100.in").read().strip()
solutions = [open(f"sol_{i}.py").read() for i in range(16)]
outputs = [run_code(sol, inputs) for sol in solutions]
from collections import Counter
freq = Counter(outputs)
if freq.most_common(1)[0][1] >= 10:
open("tests/100.out", "w").write(freq.most_common(1)[0][0])
6. Fine-Tune (7 B Example)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 train/train_SFT.py \
--model_name_or_path Qwen/Qwen2.5-Coder-7B-Instruct \
--data_path rstar_train.jsonl \
--output_dir ./rstar-coder-7b \
--num_train_epochs 6 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 4e-5 \
--bf16 True \
--fsdp "full_shard auto_wrap"
Training with flash-attention and ZeRO-0 completes in ~20 hours on 8×A100 80 GB.
Current Limitations
-
Compute cost: Both problem synthesis and test-case generation rely on frontier models (GPT-4o, QWQ-32B). -
Implicit constraints: Some competition tasks imply limits via story context; LLMs occasionally miss them. -
Scalability: Roughly 30 % of synthesized problems are discarded after mutual verification, adding overhead.
Future releases plan to incorporate additional platforms (AtCoder, TopCoder) and refine constraint parsing.
Broader Impact
By lowering the barrier to high-quality, verified code-reasoning data, rStar-Coder enables:
-
Universities to train course-specific models on smaller budgets. -
Contest preparation platforms to auto-generate fresh practice sets. -
Research groups to study reasoning without multi-billion-parameter overheads.
As with any large generative system, users should monitor for hallucinations or insecure code suggestions.
Key Take-aways for Practitioners
-
Data quality trumps model size. A carefully filtered 580 k-example set propelled a 7 B model past 32 B baselines. -
Mutual verification scales. Majority voting across multiple model outputs provides reliable labels without ground-truth oracles. -
Problem diversity is king. Expanding the set of unique problems outperforms endlessly re-rolling solutions on the same tasks.
References and Links
-
Paper: rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset -
Repository & Data: https://github.com/microsoft/rStar -
Evaluation Toolkit: MARIO_EVAL
Use them, extend them, and show that smaller models can still play in the big leagues.