On-Policy Distillation: The Cheap Way to Supercharge Small Language Models

1 months ago 高效码农

Teaching Models to Correct Themselves: A Complete Guide to On-Policy Distillation What is the cheapest way to make a small language model as good as a big one at narrow tasks? Let the small model generate its own answers, then let the big model grade every single token in real time. On-policy distillation does exactly this—online, dense, and 5-30× cheaper than RL. Table of Contents Why Post-Training Needs a Third Way Algorithm in One Breath Math Reasoning: 60 % → 70 % with 1/10 the GPU Hours Company Assistant: Add Private Knowledge, Then Get Chat Skills Back for Free Author’s …

GRPO Reinforcement Learning: Boost LLM Reasoning Accuracy 23.5% with Single-GPU Training

6 months ago 高效码农

Mastering GRPO Reinforcement Learning: Train Your LLM to Reason Like DeepSeek Using Unsloth Executive Summary: Key Findings Reasoning breakthrough: GRPO increased math reasoning accuracy by 23.5% on GSM8K benchmark Hardware democratization: Unsloth+TRL enables single-GPU training of 14B models, reducing costs by 87% vs traditional PPO Critical insights: 1B models hit reasoning ceilings (PSLE accuracy <20%) Reward function synergy: format + partial correctness > single accuracy reward (+41% convergence speed) Training risks: Incorrect KL penalties trigger reward collapse (observed 17.3% performance degradation) Industry shift: Federated learning solves data silos (Flower AI trials underway) The Reasoning Revolution: Why GRPO Changes Everything The …