Teaching Models to Correct Themselves: A Complete Guide to On-Policy Distillation

What is the cheapest way to make a small language model as good as a big one at narrow tasks?
Let the small model generate its own answers, then let the big model grade every single token in real time. On-policy distillation does exactly this—online, dense, and 5-30× cheaper than RL.

Why Post-Training Needs a Third Way
Algorithm in One Breath
Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours
Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free
Author’s Reflection: Four Traps We Fell Into
Action Checklist & One-Page Overview
FAQ

1. Why Post-Training Needs a Third Way

Core question: If supervised fine-tuning (SFT) and reinforcement learning (RL) already exist, why invent on-policy distillation?

Method	Data Source	Reward Density	Key Pain Point
SFT (off-policy)	teacher trajectories	high per-token	compounding error when student leaves teacher manifold
RL (on-policy)	student roll-outs	1 bit per answer	sample-inefficient, credit-assignment nightmare
On-policy distillation	student roll-outs	high per-token	needs teacher log-probs, but GPU-parallelisable

SFT teaches the student to copy the teacher, but only inside states the teacher has seen.
RL lets the student learn from its own mistakes, but only tells it “right/wrong” after the whole episode.
On-policy distillation samples trajectories from the student and immediately supplies a dense teacher signal for every token—combining the best of both worlds.

Application vignette:
Imagine teaching a 4-billion-parameter model to solve AMC math problems. With RL, you generate 256 roll-outs, check the final answer, and back-propagate a single reward. The model knows “21” is wrong but has no idea which of the 200 tokens was the first mistake. With on-policy distillation, the 32B teacher gives a negative score to the exact token where the student first mis-applied the quadratic formula—like a tutor circling the wrong line in red pen.

2. Algorithm in One Breath

Core question: What exactly do I need to change in my training script?

2.1 Loss = reverse KL per token

reverse_kl[t] = log π_θ(x_t | x_<t) − log π_teacher(x_t | x_<t)
loss = mean(reverse_kl)

No discounting (γ=0) worked best in our runs.
Teacher runs once forward; student does the sampling.

2.2 Four lines of pseudocode (Tinker API shown)

teacher_logp = teacher.compute_logprobs(student_rollout)
student_logp = student_rollout.logprobs
advantage = -(student_logp - teacher_logp)
trainer.update(advantage, loss_fn="importance_sampling")

Application vignette:
We started from an RL script that already computed student log-probs for KL-penalty against a frozen reference. We literally swapped the reference model path to the teacher checkpoint and replaced KL-penalty with reverse KL against teacher. The rest of the hyper-parameters (batch 64, 4 roll-outs per prompt, LoRA rank 128) stayed identical. Training loss touched zero in 150 steps.

3. Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours

Core question: How much compute does it really take to push an 8B model from 60 % to 70 % on AIME’24?

3.1 Setup

Student: Qwen3-8B-Base
Teacher: Qwen3-32B
Starting checkpoint: 400 k prompts SFT → 60 % AIME

Route	Estimated Samples	Teacher FLOP	Student FLOP	AIME’24	Relative Cost
More SFT (2 M)	~2 M	3.4×10²¹	1.5×10²¹	70 %	1×
RL (Qwen3 report)	17 920 GPU h	—	—	68 %	~1×
On-policy distill	77 k ×4	8.4×10¹⁹	8.2×10¹⁹	70 %	0.11×

3.2 Learning curve snapshot

SFT follows log-linear scaling: cheap at first, brutally expensive later.
Distillation saturates in <150 steps; reverse KL drops below 0.002.

Application vignette:
Our cluster queue was full, so we ran the teacher on 8×A100-40G and the student update on 4×A100-80G. Because teacher log-probs are embarrassingly parallel, we finished the 150-step experiment overnight instead of the usual week-long RL window.

4. Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free

Core question: After I fine-tune on internal documents, IF-eval drops. Must I rerun costly RL?

4.1 Task definition

Knowledge eval: 43 internal QA pairs (completely unseen during pre-training).
Behaviour eval: IF-eval (541 format instructions).

4.2 Mid-training outcome (70 % docs + 30 % chat SFT)

Knowledge ↑ 18 % → 36 %
IF-eval ↓ 85 % → 79 % and still falling

4.3 Self-distillation rescue

Use the original weights as teacher, run on-policy distillation on open-domain chat prompts for one more round:

IF-eval recovers to 83 % (–2 % vs original).
Knowledge stays at 41 %, even improves slightly.

Stage	Internal QA	IF-eval	Notes
Base Qwen3-8B	18 %	85 %	zero private knowledge
+SFT (70 % docs)	36 %	79 %	knowledge up, format down
+distill (self)	41 %	83 %	format back, knowledge kept

Application vignette:
Legal team pushed 3 k new clauses on a Friday. We injected them via SFT over the weekend, saw IF-eval slip Monday morning, and ran self-distillation during lunch break. The assistant passed both the knowledge regression test and the format regression test before the coffee cooled.

5. Author’s Reflection: Four Traps We Fell Into

LoRA rank too small
rank=8 loses 15 % AIME vs full fine-tune; rank=32 only 6 % and still 3× faster.
Teacher real-time bottleneck
We now cache teacher log-probs to disk and replay them multi-GPU—3× end-to-end speed-up.
Single-prompt over-fear
Distillation targets distribution, not exact answers. Training 20 epochs on one prompt copied the teacher’s style without memorising a single solution.
Zero-temperature sampling
Identical trajectories give zero gradient variance. Bumping temperature to 0.3 stabilised convergence.

6. Action Checklist & One-Page Overview

Implementation Steps

Pick a student checkpoint that at least produces legal JSON/format.
Ensure teacher exposes compute_logprobs(batch_of_tokens)—no backward pass needed.
Sampling loop: temperature 0.3, 2–8 roll-outs per prompt, store token-level log-probs.
Training loop: importance sampling, advantage = −(student_logp − teacher_logp), LR 5e-7, LoRA rank ≥32.
Stop when reverse KL < 0.002 or downstream metric plateaus.
For continual learning: alternate SFT (new knowledge) → self-distil (old behaviour).

One-Page Overview

7. FAQ

Reverse vs forward KL—why prefer reverse?
Reverse KL mode-seeks; student matches teacher’s best mode instead of covering all low-prob regions.
Does the teacher have to be larger?
No. The original weights can teach the SFT-damaged version even if size is identical.
Can I discard the RL framework entirely?
Yes—just backward through reverse_kl.mean()—but RL scripts already offer logging, checkpointing, and importance sampling for free.
Will the student copy teacher biases?
Yes, systematically wrong teacher tokens will transfer. Use ensemble teachers or later RL polishing if critical.
Is on-policy distillation safe for multi-epoch training?
Empirically yes. Because the objective is distribution-matching, not exact-answer cloning, we observed no memorisation even after 20 epochs on a single prompt.
Minimum GPU memory for 8B student + 32B teacher?
~60 GB if loaded together. Off-load teacher to CPU or pre-compute log-probs to cut requirement to 24 GB.
Any special tokenisers needed?
Both models must share vocabulary; if not, remap logits with a simple alignment matrix before computing KL.

On-Policy Distillation: The Cheap Way to Supercharge Small Language Models

Teaching Models to Correct Themselves: A Complete Guide to On-Policy Distillation

Table of Contents

1. Why Post-Training Needs a Third Way

2. Algorithm in One Breath

2.1 Loss = reverse KL per token

2.2 Four lines of pseudocode (Tinker API shown)

3. Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours

3.1 Setup

3.2 Learning curve snapshot

4. Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free

4.1 Task definition

4.2 Mid-training outcome (70 % docs + 30 % chat SFT)

4.3 Self-distillation rescue

5. Author’s Reflection: Four Traps We Fell Into

6. Action Checklist & One-Page Overview

Implementation Steps

One-Page Overview

7. FAQ

On-Policy Distillation: The Cheap Way to Supercharge Small Language Models

Teaching Models to Correct Themselves: A Complete Guide to On-Policy Distillation

Table of Contents

1. Why Post-Training Needs a Third Way

2. Algorithm in One Breath

2.1 Loss = reverse KL per token

2.2 Four lines of pseudocode (Tinker API shown)

3. Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours

3.1 Setup

3.2 Learning curve snapshot

4. Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free

4.1 Task definition

4.2 Mid-training outcome (70 % docs + 30 % chat SFT)

4.3 Self-distillation rescue

5. Author’s Reflection: Four Traps We Fell Into

6. Action Checklist & One-Page Overview

Implementation Steps

One-Page Overview

7. FAQ

Related Posts