A Breakthrough in Large Language Model Training: How GSPO Algorithm Solves Reinforcement Learning Stability Issues?

Introduction: Why Reinforcement Learning is Key to Upgrading Large Models?

In recent years, top-tier large language models (LLMs) like Qwen3 have achieved breakthroughs in complex tasks such as mathematical reasoning and programming. Reinforcement Learning (RL) technology has been instrumental in this progress. By allowing models to receive feedback after generating answers and optimize their strategies, RL has helped LLMs transition from “knowledge memorization” to “deep reasoning.”

However, as models scale beyond billions of parameters, training stability issues have become increasingly prominent. Similar to an athlete losing balance while learning acrobatics, models may suddenly collapse during training—experiencing a sharp performance drop that cannot be recovered. This phenomenon is particularly severe in models with Mixture-of-Experts (MoE) architectures.

This article will delve into the Group Sequence Policy Optimization (GSPO) algorithm proposed by the Qwen team at Alibaba DAMO Academy, revealing how it solves the stability challenges of large model reinforcement learning through innovative sequence-level optimization strategies.

1. The Dilemma of Traditional Algorithms: Why Does GRPO Cause Model Collapse?

1.1 The “Off-Policy” Challenge in Reinforcement Learning

In RL training, we typically:

Generate numerous response samples using the old policy (πθ_old)
Compute gradients for model updates using the new policy (πθ) on these samples

This process resembles “using historical experience to guide future decisions,” but introduces off-policy bias—like planning today’s activities based on yesterday’s weather forecast, which may lead to discrepancies.

1.2 Defects in GRPO’s “Token-Level” Weighting Mechanism

GRPO (Group Relative Policy Optimization), as the previous state-of-the-art algorithm, employs the following strategy:

Token-level importance weights: Calculating the contribution of each word (token) to policy updates individually
Token-level clipping mechanism: Restricting the weight of each token to prevent outliers from interfering

Critical Issue:
When processing long texts (e.g., code, complex mathematical problems):

The weights of individual tokens can vary dramatically (from 0.1 to 2.0)
These minor deviations accumulate like a snowball effect over long texts
Ultimately leading to severely distorted policy gradient calculations

Analogy for Understanding:
Imagine using 1,000 independent thermometers to measure body temperature, each with a ±10% margin of error. The final average might completely deviate from the true temperature.

2. Core Innovation of GSPO: How Sequence-Level Optimization Achieves Stable Training

2.1 Paradigm Shift from “Per-Token” to “Whole Sentence”

GSPO’s breakthrough lies in elevating the optimization unit from individual tokens to complete sequences, featuring three key designs:

2.1.1 Sequence-Level Importance Weight

# Traditional GRPO token-level weight
w_token = π_new(token) / π_old(token)

# GSPO sequence-level weight
s_sequence = [π_new(complete_response) / π_old(complete_response)]^(1/sentence_length)

Key Improvement:
By calculating the joint probability ratio of the entire response sequence (rather than the product of individual token probabilities), it effectively suppresses error accumulation in long texts.

2.1.2 Dynamic Standardized Reward Mechanism

# Calculate standardized advantage value for each response
A_i = (response_reward - group_average_reward) / group_reward_standard_deviation

Function:
Automatically balances quality differences between responses, preventing single high-reward samples from dominating the training process.

2.1.3 Sequence-Level Clipping Strategy

# Apply clipping judgment to the entire response
if s_sequence > 1+ε:  # Excessive deviation from old policy
    s_sequence = 1+ε  
elif s_sequence < 1-ε:  # Overly conservative
    s_sequence = 1-ε

Effect:
Directly filters out samples with high overall deviation, preventing “bad samples” from polluting gradient calculations.

2.2 Mathematical Principles: Why Is Sequence-Level Optimization More Stable?

Traditional GRPO gradient calculations have fundamental contradictions:

\nabla \theta \propto \sum_{token} \underbrace{w_{token}}_{\text{high variance}} \cdot \underbrace{\nabla \log \pi_{token}}_{\text{local gradient}}

GSPO achieves stability through:

\nabla \theta \propto \underbrace{s_{sequence}}_{\text{low variance}} \cdot \sum_{token} \underbrace{\nabla \log \pi_{token}}_{\text{global gradient}}

Core Difference:
GSPO’s gradient weights have global consistency, while GRPO’s token-level weights introduce random fluctuations.

3. Experimental Verification: How GSPO Surpasses GRPO

3.1 Training Stability Comparison

Figure 1 in the paper shows:

GRPO Curve: Frequent violent fluctuations during training (model collapse)
GSPO Curve: Presents a smooth upward trend (stable optimization)

Key Metric:
Under the same computational resources, GSPO reduces training time by 30-40% to achieve comparable performance.

3.2 Special Value for MoE Models

3.2.1 Unique Challenges of MoE Architecture

Mixture-of-Experts models dynamically activate different parameter subsets (experts), leading to:

The same input may activate different experts at different training stages
Traditional token-level weight calculations completely fail (as expert selection changes make historical data incomparable)

3.2.2 GSPO’s Solution

Focuses on overall sequence generation quality rather than individual token paths
Eliminates dependence on expert activation paths
Completely removes the need for complex “routing replay” techniques

Actual Effect:
In Qwen3-30B MoE model training:

Achieves convergence without additional stability strategies
Training efficiency improves by more than 2 times

4. Technical Details: Complete Workflow of GSPO

4.1 Training Process Diagram

graph TD
    A[Sample responses using old policy] --> B{Group processing}
    B --> C[Calculate sequence-level importance weights]
    C --> D[Compute standardized rewards]
    D --> E[Sequence-level clipping]
    E --> F[Gradient update]
    F --> A

4.2 Core Formula Analysis

4.2.1 Sequence Importance Weight Calculation

s_i = exp( (1/sentence_length) * Σ log(π_new(token) / π_old(token)) )

4.2.2 Objective Function

Loss = 1/G * Σ [ min(s_i * A_i, clip(s_i, 1-ε, 1+ε) * A_i) ]

5. Practical Applications: What Breakthroughs Can GSPO Bring?

5.1 Model Performance Improvement

Benchmark	GRPO Baseline	GSPO Improvement
AIME’24 Math Competition	Baseline Score	+15%
LiveCodeBench	Baseline Score	+22%
CodeForces Elo	Baseline Score	+18%

5.2 Training Efficiency Optimization

Improved Sample Utilization: Processes more data with the same computational resources
Hyperparameter Robustness: Less sensitive to learning rates, batch sizes, etc.
Long-Text Friendly: Supports generating longer and more complex responses

6. Frequently Asked Questions (FAQ)

Q1: Is GSPO suitable for all types of language models?

A: Yes. GSPO performs exceptionally well on both dense models and MoE models, particularly suitable for scenarios requiring long text generation (e.g., programming assistants, legal document generation).

Q2: What is the essential difference between GSPO and PPO algorithms?

A: PPO requires additional training of a value model to estimate advantage functions, while GSPO automatically calculates advantages through group-relative rewards, completely eliminating dependence on value models.

Q3: What should be noted in actual deployment?

Recommended to update the sampling policy (πθ_old) every 500 steps
The sequence clipping threshold ε is typically set to 0.2
Batch size recommendation ≥8 responses/query

Q4: How to verify if training is stable?

Observe if the training reward curve continues to rise
Check if gradient norms remain within reasonable ranges (typically <0.1)
Monitor whether expert activation patterns remain consistent

7. Future Prospects: How GSPO Drives AI Development

GSPO’s success validates the effectiveness of the “sequence-level optimization” paradigm, bringing new ideas to large model training:

Support for Larger Models: Provides a stable framework for training models at the trillion-parameter scale
Multimodal Extension: Can be applied to multimodal reinforcement learning involving text, images, and audio in the future
Online Learning: Enables real-time policy optimization during interactions

Conclusion

The GSPO algorithm successfully addresses the stability challenges in large model reinforcement learning by elevating the optimization unit from tokens to sequence level. This “holistic” approach not only brings technological breakthroughs but also inspires us to think: When facing complex system problems, sometimes we need to step back from local details and seek solutions from a holistic perspective.

With GSPO’s successful application in models like Qwen3, we can expect to see more large language models with deep reasoning capabilities emerge, driving the deep application of AI in fields such as scientific research, education, and professional services.

GSPO Algorithm Breakthrough: Stabilizing Large Model Reinforcement Learning