On-Policy Distillation: The Cheap Way to Supercharge Small Language Models

12 hours ago 高效码农

Teaching Models to Correct Themselves: A Complete Guide to On-Policy Distillation What is the cheapest way to make a small language model as good as a big one at narrow tasks? Let the small model generate its own answers, then let the big model grade every single token in real time. On-policy distillation does exactly this—online, dense, and 5-30× cheaper than RL. Table of Contents Why Post-Training Needs a Third Way Algorithm in One Breath Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free Author’s …