How a 7-Billion-Parameter Model Cracked Olympiad Programming: Inside Microsoft’s rStar-Coder

unsplash.com/coding-laptop

In May 2025, a research team quietly released a data set that changed the conversation around small language models (SLMs) and competitive programming. Named rStar-Coder, the project delivers 418 000 verified competition-grade code problems and 580 000 step-by-step reasoning solutions. When the team fine-tuned the modest Qwen2.5-Coder-7B on this data, the model leapt from 23 % to 62.5 % on LiveCodeBench—outperforming OpenAI o3-mini (low) and even QWQ-32B, a 32-billion-parameter powerhouse that generated the training rationales in the first place.

This article explains—without marketing fluff—how the authors built the data set, why it works, and how you can reproduce or extend their pipeline. Everything below is drawn directly from the arXiv paper and the accompanying repository; no external facts have been added.


The Challenge: Verified Data at Scale

Competitive programming is an acid-test for code reasoning. A correct submission must:

  1. Compile and run without crashes.
  2. Pass hidden test cases that stress both correctness and time/memory limits.
  3. Handle edge cases that are rarely disclosed in public statements.

Traditional data sets fall short:

Source Size Test Coverage Limitation
Codeforces public archive 50 k+ problems 2-3 sample cases Missing large or adversarial inputs
WizardCoder/Magicoder 400 k+ prompts Usually none Focus on basic function completion
HumanEval 164 tasks 10 unit tests Too small for deep reasoning

The rStar-Coder team set out to synthesize new problems that are both difficult and verifiable, then pair each problem with diverse, reliable test cases.


Three Core Building Blocks

1. Curated Seed Problems → Synthetic Variants

Step 1: Harvest
The authors collected 57 215 competition problems from:

  • Codeforces
  • AtCoder
  • CodeChef
  • LeetCode
  • USA Computing Olympiad (USACO)
  • International Olympiad in Informatics (IOI)

After de-duplication and filtering out tasks without reference solutions, 37 754 high-quality seeds remained.

Step 2: Rewrite
Instead of prompting a large model with “generate a new problem”, the authors provided both the original statement and its official solution. The prompt instructed the model to:

  • Extract the key algorithmic insight.
  • Create a fresh narrative that tests the same skill.
  • Specify new constraints to keep the task solvable yet non-trivial.

This produced 1.57 M candidate problems. Roughly 380 k survived later filtering, giving an 8-to-1 synthesis ratio.

Example transformation
Original: Minimum spanning tree on an undirected graph.
Rewritten: A city wants to connect neighborhoods with roads, but each road has both cost and environmental impact. Minimize total cost while respecting an environmental budget.

2. Reliable Test-Case Generation

Generating inputs is easy; generating correct outputs without an oracle is not. rStar-Coder decouples the process into three clear stages:

Stage Purpose Tooling
1 Produce valid, scalable inputs GPT-4o writes two utility scripts
2 Control difficulty Logarithmic scale sampling (1 → 100 000)
3 Validate outputs Mutual-verification majority vote

Stage 1: Utility Scripts

For every problem, the model outputs:

  • generate_test_input(scale)—uses the open-source CYaRon library to emit a properly formatted input string.
  • validate_test_input(raw_string)—parses and checks constraints, returning True or False.

Stage 2: Scale Sampling

Input sizes are chosen from
{1,2,…,9} ∪ {10⁰,10¹,…,10⁵}
to guarantee both tiny edge cases and full-scale stress tests.

Stage 3: Output Labelling via Mutual Verification

Because synthetic problems have no ground-truth solution, the team:

  1. Samples 16 long-chain-of-thought solutions from a frontier model (QWQ-32B).
  2. Runs each solution on ≥50 inputs.
  3. Keeps the input-output pair only if ≥60 % of solutions agree on the same output for every input.
  4. Keeps the solution if it passes all retained inputs.

The vote acts as a natural filter: incorrect solutions tend to diverge, whereas correct ones converge.

Empirical reliability
Across 3 150 sampled inputs, mutual verification achieved 96.8 % accuracy compared with 12.7 % when GPT-4o wrote outputs directly.

3. Augmenting Expert Problems with Rich Reasoning

Even the 37 k seed problems often ship with concise reference code—perfectly correct but lacking the reflective reasoning that teaches models how to think. The authors:

  • Re-generated 16 chain-of-thought solutions per seed problem.
  • Validated each solution against the newly created, large-scale test suite.
  • Retained only solutions that passed all tests.
  • For problems where QWQ-32B repeatedly failed, they kept all traces to preserve partial insights.

The final mix is:

  • 37 754 seed problems × 1–16 verified solutions
  • 380 560 synthetic problems × 1 fastest solution (by CPU time)

De-duplication against public benchmarks (HumanEval, MBPP, LiveCodeBench, USACO 2025) removed any 16-gram overlap, ensuring fair evaluation.


Training Recipe

Hyper-parameter Value
Base model Qwen2.5-Coder-Instruct (1.5 B / 7 B / 14 B)
Epochs 6
Global batch 96
Max length 16 384 tokens
Optimizer AdamW, 4 e-5, cosine decay
Hardware 8× or 32× MI300X AMD GPUs
Framework FlashAttention-2 + DeepSpeed ZeRO-0

No reinforcement learning or distillation was used—pure supervised fine-tuning on the curated data.


Evaluation Highlights

LiveCodeBench v5 (2024-08 to 2025-02)

Model Parameters Pass@1
GPT-4o ~8 B (est.) 30.0 %
o3-mini (low) unknown 59.4 %
QWQ-32B 32 B 63.4 %
rStar-Coder-14B 14 B 62.5 %
rStar-Coder-7B 7 B 57.3 %
rStar-Coder-1.5B 1.5 B 40.1 %

USA Computing Olympiad 2025

Model Bronze Silver Gold Platinum Avg
OpenAI o3 72.9 % 28.1 % 27.1 % 0 % 32.0 %
DeepSeek-R1 58.3 % 22.9 % 6.2 % 0 % 21.9 %
QWQ-32B 43.8 % 12.5 % 6.3 % 0 % 15.6 %
rStar-Coder-7B 47.9 % 4.2 % 12.5 % 0 % 16.2 %
rStar-Coder-14B 47.9 % 12.5 % 8.3 % 0 % 17.2 %

While absolute scores look low, remember that USACO 2025 Platinum problems are designed for high-school national-team selection; even frontier models rarely solve them.


Ablation Studies

Problem Diversity vs. Solution Count

Fixing 37 k seed problems and varying the number of long-CoT solutions per problem:

Unique Problems Solutions per Problem Total Examples LiveCodeBench USACO
37 754 1 37 754 40.8 % 0.52 %
37 754 8 302 032 51.1 % 3.64 %
37 754 16 603 064 54.7 % 10.4 %
418 314 (mixed) ≈1.4 580 000 57.3 % 16.2 %

Take-away: Increasing problem coverage yields better returns than merely multiplying solutions for the same tasks.

Input Generation Quality

Direct GPT-4o prompting produced inputs clustered at small scales (10⁰–10²).
The three-step pipeline shifted the distribution up to 10⁵, improving final model performance across easy, medium, and hard LiveCodeBench splits.


Reproduction Guide (Minimal Set-Up)

The official repository is fully open-source. Below is a condensed walk-through for an Ubuntu 22.04 + A100 80 GB node.

1. Environment

# Python 3.11
conda create -y -n rstar python=3.11
conda activate rstar

# Core requirements
pip install --upgrade pip
pip install -r requirements.txt          # from the repo
pip install flash-attn --no-build-isolation  # optional speed-up

2. Evaluation Toolkit

git clone https://github.com/MARIO-Math-Reasoning/MARIO_EVAL.git
cd MARIO_EVAL && pip install -e . && cd ..

3. Generate New Problems (Optional)

MODEL=deepseek-ai/DeepSeek-Coder-V2-Instruct
CFG=config/sample_mcts.yaml
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
    --qaf seed_problems.jsonl \
    --custom_cfg $CFG \
    --model_dir $MODEL

4. Build Test Cases for a Single Problem

# generate_test_input.py  (auto-created by GPT-4o)
import cyaron as cy
def generate_test_input(n):
    if not (1 <= n <= 100000):
        return None
    a = cy.Vector.random(n, [(1, 10**9)])
    return f"{n}\n{' '.join(map(str, a))}"

# validate_test_input.py
def validate_test_input(s):
    lines = s.strip().split('\n')
    if len(lines) != 2:
        return False
    n = int(lines[0])
    nums = list(map(int, lines[1].split()))
    return len(nums) == n and all(1 <= x <= 10**9 for x in nums)

# Emit 50 inputs of varying sizes
for scale in [1, 10, 100, 1000, 10000, 100000]:
    inp = generate_test_input(scale)
    if validate_test_input(inp):
        open(f"tests/{scale}.in", "w").write(inp)

5. Mutual Verification Script (Pseudo-code)

from tqdm import tqdm
inputs = open("tests/100.in").read().strip()
solutions = [open(f"sol_{i}.py").read() for i in range(16)]
outputs = [run_code(sol, inputs) for sol in solutions]

from collections import Counter
freq = Counter(outputs)
if freq.most_common(1)[0][1] >= 10:
    open("tests/100.out", "w").write(freq.most_common(1)[0][0])

6. Fine-Tune (7 B Example)

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 train/train_SFT.py \
  --model_name_or_path Qwen/Qwen2.5-Coder-7B-Instruct \
  --data_path rstar_train.jsonl \
  --output_dir ./rstar-coder-7b \
  --num_train_epochs 6 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 4e-5 \
  --bf16 True \
  --fsdp "full_shard auto_wrap"

Training with flash-attention and ZeRO-0 completes in ~20 hours on 8×A100 80 GB.


Current Limitations

  • Compute cost: Both problem synthesis and test-case generation rely on frontier models (GPT-4o, QWQ-32B).
  • Implicit constraints: Some competition tasks imply limits via story context; LLMs occasionally miss them.
  • Scalability: Roughly 30 % of synthesized problems are discarded after mutual verification, adding overhead.

Future releases plan to incorporate additional platforms (AtCoder, TopCoder) and refine constraint parsing.


Broader Impact

By lowering the barrier to high-quality, verified code-reasoning data, rStar-Coder enables:

  • Universities to train course-specific models on smaller budgets.
  • Contest preparation platforms to auto-generate fresh practice sets.
  • Research groups to study reasoning without multi-billion-parameter overheads.

As with any large generative system, users should monitor for hallucinations or insecure code suggestions.


Key Take-aways for Practitioners

  1. Data quality trumps model size. A carefully filtered 580 k-example set propelled a 7 B model past 32 B baselines.
  2. Mutual verification scales. Majority voting across multiple model outputs provides reliable labels without ground-truth oracles.
  3. Problem diversity is king. Expanding the set of unique problems outperforms endlessly re-rolling solutions on the same tasks.
unsplash.com/coding-future

References and Links

Use them, extend them, and show that smaller models can still play in the big leagues.