K2-Think: How a 32-Billion-Parameter Model Outperforms Giants in Math Olympiads

A conversation starter
“Can a model small enough to fit on four gaming GPUs beat the latest 120-billion-parameter heavyweights at high-school math competitions?”
The Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) just proved the answer is ‘yes’.
Below is a fully-transparent walk-through of their K2-Think recipe—data, code, training budget, safety filters and all—rewritten for junior-college graduates and busy engineers who simply want facts, numbers and reproducible steps.

1. Thirty-second summary

Base model: Qwen2.5-32B (completely open weights)
Post-training data: one open-source set, 92 k problems with automatically checkable answers
Training stages: long-chain supervised fine-tuning → verifiable-reward RL → simple test-time tricks
Hardware for inference: Cerebras Wafer-Scale Engine (WSE), 2 000 tokens/s per user
Result: AIME 2024/2025, HMMT25 and Omni-MATH-HARD micro-average 67.99, higher than GPT-OSS-120B (67.20) and DeepSeek-V3.1-671B (64.43)

No secret data, no proprietary hardware lock-in, no million-dollar price tag. The entire stack is on GitHub and ready to clone.

2. Why should you care?

Pain point	K2-Think fix
Big models are slow and expensive	32 B params + WSE = 10× speed, 1/4 energy
RL needs human labels	Uses verifiable rewards (right/wrong = 1/0)
Long prompts hurt latency	Speculative decoding + on-chip weights remove the bottleneck
Safety worries	Open-sourced safety report + red-team scores included

If you build chatbots, tutoring apps, or coding assistants, this is a reference design you can actually afford to host.

3. Six pillars at a glance

Long-chain-of-thought supervised fine-tuning (SFT)
Reinforcement Learning with Verifiable Rewards (RLVR)
Plan-Before-You-Think prompt scaffolding
Best-of-N sampling at serve time
Speculative decoding for cheaper long outputs
Cerebras WSE deployment (optional but blazing fast)

Sections 4-9 show the exact commands and config files; no steps are skipped.

4. Data: the only corpus you need

Dataset name: a-m-team/AM-Thinking-v1-Distilled
Commit hash: 3697c1829816a2b8d4d25995ed6d5d27ffb49b30
Domains: Math, Code, Science, Logic, Tabular, Simulation
Size: 92 k prompts + final answer + chain-of-thought trace

All answers are numerical or short strings, so a 10-line Python script can act as the reward function—no human preference labeling required.

4.1 Quick download & format

git clone https://github.com/MBZUAI-IFM/K2-Think-SFT.git && cd K2-Think-SFT
pip install -r requirements.txt          # LLaMA-Factory underneath
python get_am_dataset.py \
    --revision 3697c1829816a2b8d4d25995ed6d5d27ffb49b30 \
    data/AM-Thinking-3697c18.parquet

The script unifies six unequal subsets and registers "AM-Thinking-3697c18" in data/dataset_info.json. Forget this step and load_dataset() throws a schema error.

5. Chat template: teach the model to “think out loud”

File: src/llamafactory/data/template.py (already patched)

Special tokens

<|im_start|> – start of a message
<|im_end|> – end of a message

System prompt (literally copied)

You are a helpful assistant. To answer the user’s question, you first think about the reasoning process and then provide the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.

Example tokenised turn

<|im_start|>system
You are a helpful assistant. To answer ...<|im_end|>
<|im_start|>user
AIME 2025 Problem: ...<|im_end|>
<|im_start|>assistant
<think> Let me first factor the quadratic ...</think>
<answer> 337 </answer><|im_end|>

The model learns to output thousands of tokens inside <think> before giving the final <answer>.

6. Supervised fine-tuning config

YAML file: examples/train_full/Qwen2.5-32B-base-AM-Thinking-v1-Distilled-3697c18.yaml

Key	Value	Comment
model_name_or_path	your/path/Qwen2.5-32B	Must point to the base (not instruct) version
max_length	32 768	Anything shorter truncates long proofs
per_device_batch_size	2	One 80 GB GPU holds exactly two 32k sequences
gradient_accumulation_steps	1	Global batch = nodes×GPUs×2 = 512
learning_rate	1×10⁻⁴	Cosine schedule, 5 % warmup
num_epochs	2	Beyond that AIME score plateaus (Figure 2 of paper)

Slurm launcher example (32 nodes, 8 GPUs each)

#!/bin/bash
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
conda activate k2
torchrun --nnodes 32 --nproc_per_node 8 \
    src/train.py examples/train_full/Qwen2.5-32B-base-AM-Thinking-v1-Distilled-3697c18.yaml

Change the paths inside the YAML and the bash script to match your cluster. On first run the framework tokenises everything to tokenized_path; expect 30 min of preprocessing.

7. Reinforcement-learning follow-up

Observation from the paper (no external guesswork)

RLVR from base → +40 % AIME 2024 in 2 500 steps
RLVR from SFT checkpoint → +5 % and early saturation
Lesson: if you want bigger gains, start RL from the base model, not from the already-strong SFT model.

Dataset for RL: Guru-92k (same six domains, different split)
Reward: exact-match binary reward, 0 or 1
Algorithm: GRPO implemented in the open-source verl library
Code drop: https://github.com/MBZUAI-IFM/K2-Think-RL

8. Test-time tricks: plan first, sample three times

8.1 Plan-Before-You-Think

An external lightweight LLM (could even be 7 B) is prompted to:

Extract key concepts
Write a high-level solution sketch

That sketch is pre-pended to the user prompt before K2-Think generates.
Effect: average output shrinks 6–12 %, accuracy rises 3–4 %.

8.2 Best-of-3 sampling

Temperature = 1.0 → diverse answers
Generate three independent chains
Ask a verifier LLM to pick the better of each pair → final answer

Cost: 16 s×3 = 48 s on WSE, still real-time.
Gain: +4–6 percentage points across all four math benchmarks.

9. Serving: from three minutes to sixteen seconds

Typical 32 k-token generation

Hardware	Time	Tokens/s
NVIDIA H100	~160 s	~200
Cerebras WSE	16 s	~2 000

Why so fast?

All 32 B weights stay in on-chip memory
Memory bandwidth 25 PB/s (versus 8 TB/s for H100)
Speculative decoding adds another ~2× factor

Interactive “show your work” tutoring becomes practical.

10. Head-to-head numbers (average of 16 runs)

Benchmark	K2-Think 32B	GPT-OSS-120B	DeepSeek-V3.1-671B	GPT-5-high
AIME 2024	90.8 %	89.6 %	91.9 %	94.8 %
AIME 2025	81.2 %	84.6 %	82.5 %	92.2 %
HMMT25	73.8 %	81.9 %	83.5 %	91.8 %
Omni-MATH-HARD	60.7 %	57.8 %	53.2 %	73.6 %
Micro-average	68.0 %	67.2 %	64.4 %	80.2 %

K2-Think beats models 2–20× larger on the hardest Olympiad problems while running 10× faster in production.

11. Safety check: red-team scorecard

Test sets: Do-Not-Answer, HarmBench, PhysicalSafety, SimpleSafetyTests, ToxiGen, CoNA, HarmfulQ, DialogueSafety, HH-RLHF, DICES350, PersonalInfoLeak, CyberattackAssistance, PromptExtractionRobustness plus nine jail-break templates.
Sampling: 100 prompts per set.

Consolidated Safety-4 score

Aspect	Macro-average
High-risk content refusal	0.83
Conversational robustness	0.89
Cyber-security & privacy	0.56
Jail-break resistance	0.72
Overall	0.75

Main gaps: cyber-attack assistance (0.47) and prompt extraction (0.35). The team will ship extra safety filters before the public API graduates from beta.

12. One-page recap (print or pin)

Data: 92 k problems, auto-graded, already on Hugging Face
SFT: 2 epochs, 32 k length, cosine 1e-4, batch 512
RL: start from base, not SFT, for +40 % math gain
Serve: plan → generate 3× → pick best; WSE gives 2 k tokens/s
Result: 68 % on AIME/HMMT/Omni-HARD, tops all open-source, tiny budget
Safety: 0.75 composite, weaknesses documented, fixes underway

13. Frequently-asked questions

Q1: I only have four RTX 4090s. Can I still train?
A: Yes. Reduce per-GPU batch to 1 and gradient-checkpoint. Training stretches from 3 days to ~10 days but the paper shows <1 % score drop.

Q2: Is the Cerebras hardware mandatory?
A: No. The weights run on PyTorch + CUDA. WSE is one deployment option—albeit the fastest currently public.

Q3: How much does the entire pipeline cost in the cloud?
A: 32 nodes × 8 × A100-80GB for 60 hours ≈ 15k USD at AWS on-demand prices. Inference on WSE is free during the beta API.

Q4: Which base models besides Qwen2.5-32B were tried?
A: Llama-3-70B and Qwen3-235B were tested; both converged slower under the same 32 k-length recipe. Qwen2.5-32B gave the best accuracy-per-dollar.

Q5: Will a 7B version be released?
A: The authors mention a follow-up distillation project. Stay tuned to their GitHub org page.

14. Ready-to-run commands (copy-paste)

# 1. Clone
git clone https://github.com/MBZUAI-IFM/K2-Think-SFT.git && cd K2-Think-SFT

# 2. Env
conda create -n k2 python=3.10 && conda activate k2
pip install -r requirements.txt

# 3. Data
python get_am_dataset.py \
    --revision 3697c1829816a2b8d4d25995ed6d5d27ffb49b30 \
    data/AM-Thinking-3697c18.parquet

# 4. Edit paths in YAML (model folder, deepspeed config, output dir)

# 5. Train (single-node example)
torchrun --nproc_per_node=8 src/train.py \
    examples/train_full/Qwen2.5-32B-base-AM-Thinking-v1-Distilled-3697c18.yaml

# 6. RL stage (optional but recommended)
git clone https://github.com/MBZUAI-IFM/K2-Think-RL.git
cd K2-Think-RL && bash scripts/run_grpo_qwen32b.sh

# 7. Local inference (no Cerebras)
python inference.py --model_path $OUTPUT_DIR --prompt "AIME 2025 Problem..."

15. Where to grab the artefacts

Item	Link
SFT & data scripts	https://github.com/MBZUAI-IFM/K2-Think-SFT
RLVR scripts	https://github.com/MBZUAI-IFM/K2-Think-RL
Inference docker	https://github.com/MBZUAI-IFM/K2-Think-Inference
Weights on HF	https://huggingface.co/LLM360/K2-Think
Demo & API	https://k2think.ai

16. Citation (bibtex)

@misc{k2think2025,
  title={{K2-Think}: A Parameter-Efficient Reasoning System},
  author={Cheng, Zhoujun and Fan, Richard and Hao, Shibo and Kilian, Taylor W. and Li, Haonan and Sun, Suqi and Ren, Hector and Moreno, Alexander and Zhang, Daqian and Zhong, Tianjun and Xiong, Yuxin and Hu, Yuanzhe and Xie, Yutao and Han, Xudong and Wang, Yuqi and Pimpalkhute, Varad and Zhuang, Yonghao and Singh, Aaryamonvikram and Liang, Xuezhi and Xie, Anze and She, Jianshu and Fan, Desai and Gao, Chengqian and Ma, Liqun and Yurochkin, Mikhail and Maggs, John and Ma, Xuezhe and He, Guowei and Hu, Zhiting and Liu, Zhengzhong and Xing, Eric P.},
  howpublished={arXiv:2509.07604},
  year={2025}
}

Bottom line: K2-Think delivers GPT-class math performance on a single commodity GPU and 16-second interactive latency. Grab the data, run the scripts, and join the growing crowd that proves smart post-training beats blind scale.