Pangu Pro MoE: How Grouped Experts Revolutionize Load Balancing in Giant AI Models

Huawei’s breakthrough MoGE architecture achieves perfect device workload distribution at 72B parameters, boosting inference speed by 97%

The Critical Challenge: Why Traditional MoE Fails in Distributed Systems

When scaling large language models (LLMs), Mixture of Experts (MoE) has become essential for managing computational costs. The core principle is elegant: Not every input token requires full model activation. Imagine a hospital triage system where specialists handle specific cases. But this “routing” process hides a fundamental flaw:

graph TD
    A[Input Token] --> B(Router)
    B --> C{Expert Selection}
    C --> D[Expert 1] 
    C --> E[Expert 2]
    C --> F[...] 
    D --> G[Output]
    E --> G
    F --> G

In distributed systems, traditional MoE reveals critical weaknesses:

  • Hotspot Experts: Frequently activated experts create bottlenecks (like crowded cardiology wards)
  • Device Underutilization: Experts rarely selected idle their hardware (empty pediatric clinics)
  • Systemic Inefficiency: Overall speed constrained by slowest device (the “bucket effect”)

Huawei quantifies this imbalance through the Imbalance Score (IS):

IS(X)=\frac{1}{|X|}\left[\max_{i}T_i(X)-\min_{i}T_i(X)\right]

(Where (T_i) represents computation load on device (i))

Experimental data (Figure 2b) shows staggering results: With 64 experts, 8 activated per token, and 8 devices, traditional Top-K routing approaches 100% imbalance probability. Picture checkout lanes—some with 20 customers, others empty—wasting resources catastrophically.

Architectural Breakthrough: Inside Huawei’s MoGE Design

Core Mechanics of Mixture of Grouped Experts

The solution operates like a precision traffic control system:

graph LR
    A[Input Token] --> B[Global Router]
    B --> C{Expert Partitioning}
    C --> D[Device 1 - Group 1]
    C --> E[Device 2 - Group 2]
    C --> F[Device 3 - Group 3]
    D --> G[Intra-Group Top-K Selection]
    E --> H[Intra-Group Top-K Selection]
    F --> I[Intra-Group Top-K Selection]
    G --> J[Weighted Output]
    H --> J
    I --> J

Key innovations:

  1. Expert Partitioning: Distribute (N) experts equally across (M) devices (e.g., 64 experts → 8 groups)
  2. Grouped Routing: Select (K’) experts per group ((K’ = K/M))
  3. Structural Balance: Guarantee equal workload per device

Pangu Pro MoE configuration (Table 1):

Parameter Value
Total Experts 64
Device Groups (M) 8
Experts per Group 8
Activated per Group 1
Activated Params 16.5B
Total Params 71.99B

Dual-Layer Balancing System

graph TB
    A[Structural Balance] -->|Device-Level| B(IS=0)
    C[Auxiliary Loss] -->|Intra-Group| D(Prevents Expert Underuse)

Auxiliary loss function:

ℓ_aux = α∑f_i·p_i

(Where (f_i) = token allocation rate, (p_i) = average gating score)

This acts like a manager’s KPI: Maximize expert utilization while ensuring optimal token routing.

Hardware Synergy: Ascend NPU Optimization Secrets

Training System Mastery (4K Ascend 800T A2 Cluster)

# Parallel configuration
parallel_config = {
    "Tensor Parallelism": 8, 
    "Expert Parallelism": 2,
    "Pipeline Parallelism": 5,
    "Virtual Pipeline": 5,
    "Context Parallelism": 1
}

Revolutionary techniques:

  • Hierarchical EP Comm: 50% less communication volume
  • Adaptive Pipeline Overlap: Compute-communication parallelism
  • Operator Fusion: Custom Matmul kernels boost cube utilization by 10%

Results (Table 2):

Configuration MFU Gain Key Tech
Baseline TP1, EP8, CP8, PP6, VPP4
Optimized +35% TP8, EP2, CP1, PP5, VPP5

Inference Acceleration Triad

  1. H²P Hybrid Parallelism

    • Attention: DP2+TP4
    • Experts: TP2+EP4 hybrid
    • Shared Experts: TP8 dense compute
  2. Expert-Aware Quantization
    Solves MoE-specific challenges:

    \overline{s}_j=\max\left(\underbrace{\max_{i}\left(\frac{\max(|\mathbf{x}_j|)^α}{\max(|\mathbf{W}^s_j|)^{1-α}}\right)}_{\text{Expert Requirement}},\underbrace{\max(|\mathbf{x}_j|)^α}_{\text{Router Requirement}}\right)
    
  3. Kernel Fusion Breakthroughs

    • MulAttention: 4.5x speedup

      sequenceDiagram
          Memory->>Cache: Bulk KV Transfer (Steps 1,5)
          Cache->>Compute: Dual-Loop Pipeline (Steps 2-4,6-8)
          Compute->>Output: 89% Utilization via Ping-Pong Buffering
      
    • SwiftGMM: 95% hardware utilization

      graph LR
          A[Dynamic Load] --> B{Predict Tiling Params}
          B --> C[GEMV/GEMM Mode Switch]
          C --> D[Full-Matrix L1 Cache]
          D --> E[Dual-Buffer Overlap]
      

## Performance Benchmark: Redefining State-of-the-Art

### Dominating LLM Leaderboards (Tables 3,4)
| Benchmark               | Pangu Score | Best Competitor | Advantage |
|-------------------------|-------------|-----------------|-----------|
| MMLU (Understanding)    | 87.4        | 84.2 (Qwen2.5)  | +3.2      |
| C-Eval (Chinese Knowledge)| 90.6        | 87.7 (Qwen2.5)  | +2.9      |
| GSM8K (Math Reasoning)  | 86.5        | 85.4 (GLM4)     | +1.1      |
| HumanEval (Coding)      | 63.7        | 59.1 (GLM4)     | +4.6      |

Critical finding: **Matches 32B dense models using only 16B activated parameters!**

### Inference Speed Revolution
**Ascend 800I A2 Results (Tables 5,6)**:
| Phase     | Batch Size | Pangu Throughput | 72B Dense Model | Gain  |
|-----------|------------|------------------|-----------------|-------|
| Prefill   | 2          | 4,828 tokens/s   | 1,596 tokens/s  | +203% |
| Decode    | 456        | 1,148 tokens/s   | 583 tokens/s    | +97%  |
| Decode*   | 584        | 1,528 tokens/s   | -               | -     |

(*With multi-token prediction optimization)

**Ascend 300I Duo Cost Efficiency (Table 7)**:
| Phase    | Batch Size | Latency   | Throughput     |
|----------|------------|-----------|----------------|
| Prefill  | 2          | 1,940 ms  | 1,055 tokens/s |
| Decode   | 80         | 99.5 ms   | 201 tokens/s   |
| Decode*  | 128        | 99.7 ms   | 321 tokens/s   |

### Expert Behavior Analysis
**Specialization Evolution (Figure 7)**:
```mermaid
graph LR
    A[Layer 0] -->|Uniform| B[General Feature Extraction]
    B --> C[Layer 23]
    C -->|Emerging Specialization| D[Layer 47]
    D -->|Task-Specific| E[Targeted Processing]
  • Language tasks: Balanced expert activation
  • Math/coding: High expert specialization

Load Balance Proof (Figure 10):

  • Traditional MoE: 30% expert load variance
  • MoGE: Near-perfect 12.5% distribution (σ < 2%)

Real-World Case Studies: Beyond Benchmarks

Cultural Understanding (Table 13)

Question: Which hand faces outward in Chinese “gongshou” etiquette?
Competitor Error: “Right hand = friendliness” (culturally inaccurate)
Pangu’s Correct Response:

“Left hand outward, right hand inward—rooted in ancient Chinese tradition where ‘left’ denotes respect (e.g., Left Chancellor > Right Chancellor)”

Mathematical Reasoning (Table 10)

Complex Expression: 28.97-(35%)-82*40-58.87
Critical Challenge: Contextual interpretation of “%” symbol
Pangu’s Solution:

Step 1: 35% → 0.35 (scalar conversion)
Step 2: 28.97 - 0.35 = 28.62
Step 3: 82*40 = 3,280
Step 4: 28.62 - 3,280 = -3,251.38
Step 5: -3,251.38 - 58.87 = -3,310.25

Instruction Compliance (Table 12)

Strict Requirements:

  1. Reproduce query verbatim
  2. Include 6+ “!”
  3. No prefixed text
    Pangu’s Flawless Execution:
Write a short startup pitch... [exact query reproduction]
Presenting "Sunnis Ice Cream"-- the revolutionary treat!!! ... stomach-approved!!! Join the revolution!!!

The Road Ahead: Implications for Sparse Models

Pangu Pro MoE’s breakthroughs demonstrate how architectural innovation solves systemic constraints:

  1. Hardware-Aware Design: Communication topology optimized for Ascend NPUs
  2. Dynamic Load Balancing: MoGE eliminates device-level imbalance by design
  3. Efficiency-Accuracy Tradeoff: <0.8% accuracy loss under W8A8 quantization (Table 8)
pie
    title Sub-100B Parameter Model Ecosystem
    “MoGE Architecture” : 45
    “Traditional MoE” : 30
    “Dense Models” : 25

Technical Appendix: Key Questions Answered

Q1: Why does traditional MoE fail in distributed systems?

A: Random global Top-K routing causes uneven expert activation. At batch size=16, imbalance probability reaches 99%—like assigning 10 patients to one doctor while others sit idle.

Q2: How does MoGE prevent expert underutilization?

A: Dual safeguards:

  1. Structural Enforcement: Fixed activations per group
  2. Auxiliary Loss: Penalizes intra-group imbalance via ( \ell_{aux} = \alpha \sum f_i \cdot p_i )

Q3: Can 72B models run on edge devices?

A: Ascend 300I Duo achieves 201 tokens/s via:

  • W8A8 Quantization: 50% memory reduction
  • H²P Parallelism: Optimized 4-card deployment
  • KV Cache Compression: 63% less communication

Q4: What does expert specialization reveal?

A: Layer-wise analysis (Figure 7) shows:

  • Early layers: General processing (uniform activation)
  • Deep layers: Task-specific focus (e.g., math experts)
  • Balance maintenance: <5% intra-group utilization variance (Figure 9)

Paper: https://gitcode.com/ascend-tribe/pangu-pro-moe