Why RL for Large Language Models Keeps Crashing — and the 7 Engineering Tweaks That Finally Made a 30B MoE Stable After 300k GPU Hours
“
What makes policy-gradient RL for LLMs explode, and how do we stop it?
Token-level objectives are only a first-order approximation of the true sequence reward. When the training-inference gap or policy staleness grows, the approximation breaks. Importance sampling, clipping and Routing Replay keep the two gaps small and training stable.
0. One-glance cheat-sheet
1. Why is the “real” sequence-level objective so hard to optimise?
Core question: If we want to maximise the expected reward of full responses, why not plug the sequence probability into REINFORCE and be done?
Short answer: The sequence likelihood ratio π_θ(y|x)/μ_old(y|x) has gigantic range and variance; gradient noise becomes unusable.
The paper shows that even with 1024-sample baselines, the gradient estimator’s SNR stays <1 after a few hundred steps. In practice the training curve instantly wiggles or collapses unless you add corrective terms.
“
Author reflection: I once thought “just increase batch size” would drown the variance. We burned 10k A100 hours before admitting defeat — variance scales with the square of the range, not batch size.
2. Token-level surrogate — a first-order approximation that sometimes works
Core question: How can a sum over token logits approximate the whole-sequence ratio?
Short answer: Taylor-expand the product of token ratios; drop second-order terms and you get Σ_t δ_t. This is valid only if every δ_t (the per-token logit drift) is tiny.
The paper derives:
π_θ(y|x)/μ_old(y|x) ≈ 1 + Σ_t [π_θ(y_t|…)/μ_old(y_t|…) – 1]
so the gradient of the sequence objective collapses to the gradient of the token objective when θ≈θ_old and training logits ≈ inference logits.
Two gaps can still blow this up:
-
training-inference discrepancy — different kernels, numerical precision, MoE routing tables -
policy staleness — the rollout policy is older than the policy you are updating
Keep both gaps small and the surrogate stays unbiased; let either grow and the noise floor rises exponentially.
3. Knob #1 — keep the training-inference gap tiny with token-level Importance Sampling
Core question: I already use the same weights for train and rollout — why do I still see KL spikes?
Short answer: Different software stacks (Megatron vs vLLM, FP8 vs BF16, MoE routing kernels) give different logits even with identical parameters. Token IS corrects that drift.
Example from the paper:
FP8 inference engine + BF16 training engine → KL(μ‖π_old) ≈ 0.008 at step 0. Without IS correction, the first parameter update sees an effective learning-rate 3× larger than intended and entropy dives; training never recovers. With IS, KL stays <0.001 and loss curve stays smooth.
Code snippet (PyTorch-like):
logp_train = train_engine.logits(tokens) # BF16
logp_roll = infer_engine.logits(tokens) # FP8
ratio = (logp_train - logp_roll).exp() # token-level IS weight
loss = -(ratio * advantage * logp_train).mean()
“
Reflection: We initially disabled IS to save one all-reduce; the run collapsed at step 312. Rebooting cost more than the saved 2% compute — classic penny-wise, pound-foolish.
4. Knob #2 — fight policy staleness with PPO-style clipping
Core question: My global batch is huge; I split it into mini-batches to reuse rollouts. Entropy still tanks — why?
Short answer: Later mini-batches are updated against stale rollouts. Clipping the token ratio stops aggressive steps and keeps the approximation valid.
The paper tests N = 2/4/8 mini-batches. Without clipping, entropy falls below 0.15 before step 1500 for N=4; with clipping (ε_low=0.2, ε_high=0.27) the same run reaches 4000 steps and keeps entropy ≈0.35.
Practical tip:
Use decoupled-PPO clipping: decide per token whether to clip, but still use the original (not clipped) ratio for the IS weight. This keeps gradient unbiased while capping the worst-case drift.
5. Knob #3 — MoE expert routing makes both gaps worse; fix it with Routing Replay
Core question: My dense model trains fine; once I switch to MoE the same code collapses — what changed?
Short answer: Expert choices now differ between engines and between old/new params, adding routing discrepancy on top of logit discrepancy. Routing Replay freezes the expert pattern during the update, turning the MoE into a “virtual dense” model for that step.
Two flavours:
- ❀
Vanilla Routing Replay (R2): reuse the training-engine expert mask e^π_old - ❀
Rollout Routing Replay (R3): reuse the inference-engine expert mask e^μ_old
R2 leaves the first mini-batch’s target policy intact; R3 does not.
Experimentally:
- ❀
N=2 → R2 slightly better (0.75 vs 0.73) - ❀
N≥4 → R3 wins and stays stable; R2 collapses at ~2500 steps
“
Author reflection: We picked R2 “because it felt safer” and lost a week replaying the wrong mask. Pick R3 once off-policiness >4×; the math is unambiguous.
6. Experimental proof-of-work: 30B MoE, 300k GPU hours, math reasoning task
Core question: Do the three knobs actually move the final score, or are they just cosmetic?
Short answer: They decide whether you reach any meaningful score at all. Stable runs converge to ~0.77 accuracy on AIME-style problems regardless of cold-start; unstable runs plateau ≤0.55 or collapse to entropy 0.05.
Setup
- ❀
Base model: Qwen3-30B-A3B (MoE) - ❀
Prompts: 4096 competition-level math problems, binary reward - ❀
Hardware: 1024 A100 80GB, FP8 inference, BF16 training - ❀
Generation length: 32k tokens - ❀
Metrics: accuracy on HMMT25/AIME25/AIME24 (90 questions), entropy, KL(μ‖π_old)
Key curves (see Figures 1-4 in the paper)
- ❀
on-policy without IS → collapse at step 200, KL 0.02 - ❀
off-policy N=8 without clipping/R3 → collapse at step 600 - ❀
off-policy N=8 with IS+clipping+R3 → stable to 5000 steps, final score 0.77
7. Cold-start choice is washed out once training is stable
Core question: Should I spend weeks distilling the “perfect” teacher model for RL initialization?
Short answer: No. Three different teachers (Qwen3-Max, DeepSeek-R1, GPT-oss-120B) reach the same 0.86 accuracy within 600 gradient steps when the stable recipe is used; difference <0.01.
Implication: invest marginal engineering hours in stabilising RL, not in hunting the ultimate cold-start. The long RL phase cancels initial bias.
8. Action Checklist / Implementation Steps
-
Always compute token-level IS weight if training & inference engines differ — even “only FP8”. -
Set clipping bounds ε_low ≈ 0.2, ε_high ≈ 0.27; update per token; leave IS weight unclipped. -
MoE model + off-policy factor N≥4 → enable R3; N≤2 → R2 is enough and slightly faster. -
Monitor entropy and KL(μ‖π_old); stop & restart if entropy <0.25 or KL>0.01. -
Don’t over-engineer the teacher model; any reasonable cold-start converges under stable RL.
9. One-page Overview
We derive that the ubiquitous token-level policy gradient is a first-order approximation of the desired sequence-level objective. The approximation stays valid only while (i) training logits stay close to inference logits and (ii) the rollout policy remains near the current policy. Empirically, three fixes keep these gaps small:
- ❀
Token Importance Sampling removes engine-level discrepancy - ❀
PPO clipping caps policy staleness introduced by mini-batch re-use - ❀
Routing Replay (R2/R3) neutralises expert-routing drift in MoE models
Extensive 30B-MoE experiments (300k GPU h) show that runs equipped with the full stabilising suite consistently reach 0.77 accuracy on competition-math benchmarks and survive >5k steps, whereas ablated variants collapse within hundreds of steps. Once stability is achieved, final performance becomes independent of cold-start origin, suggesting that engineering effort should focus on robust RL implementation rather than perfect initialisation.
10. FAQ
Q1: Can I just increase batch size to solve the variance problem?
A: No — sequence likelihood ratio range grows exponentially with length; batch size only linearly reduces variance.
Q2: Is clipping alone sufficient for MoE models?
A: Clipping handles staleness but not expert-routing discrepancy; you still need Routing Replay for heavy off-policy updates.
Q3: Do I need R3 for pure on-policy training?
A: No. R3 slightly hurts performance in the N=1 regime; use it only once N≥4.
Q4: Why not use value-based PPO?
A: Reliable per-token value estimates are hard for sparse sequence rewards; the paper found variance even higher.
Q5: Which ε values worked best?
A: ε_low=0.2, ε_high=0.27 were stable across all off-policy factors tested.
Q6: How do I detect collapse early?
A: Watch entropy <0.25 or KL(μ‖π_old)>0.01; both foreshadow divergence within ~100 steps.
Q7: Does length normalisation help?
A: It breaks the first-order equality; the paper shows −0.1 accuracy drop versus un-normalised loss.
Q8: Is the recipe specific to math tasks?
A: The theory is task-agnostic; math was chosen for clean binary rewards, but the stabilising mechanisms apply to any sequence-level reward.

