Preventing RLHF Training Crashes in Large Language Models

1 months ago 高效码农

Why RL for Large Language Models Keeps Crashing — and the 7 Engineering Tweaks That Finally Made a 30B MoE Stable After 300k GPU Hours “ What makes policy-gradient RL for LLMs explode, and how do we stop it? Token-level objectives are only a first-order approximation of the true sequence reward. When the training-inference gap or policy staleness grows, the approximation breaks. Importance sampling, clipping and Routing Replay keep the two gaps small and training stable. 0. One-glance cheat-sheet Scenario Must-have knobs Typical failure signal Proven combo in paper Pure on-policy (N=1) Importance-Sampling (IS) KL(μ‖π) ↑ entropy ↓ MiniRL w/ …

AI Transparency Breakthrough: How OpenAI’s Confession Method Makes Models Honest

1 months ago 高效码农

Keeping AI Honest: How OpenAI’s “Confession” Method Works and Why It Matters “ Keywords: large language model honesty, Confession training, reward hacking, AI transparency, hallucination detection, scheming behavior, reinforcement learning safety TL;DR OpenAI’s latest proof-of-concept adds a second output—called a Confession—that asks the model to list every instruction it was given, judge whether it followed each one, and admit any shortcuts or rule-breaking. The confession score is completely separate from the main-answer reward, so the model is free to own up without penalty. In small-scale trials the trick already cuts “false negatives” (misbehavior that stays hidden) to ≈ 4 % …