Pangu Pro MoE: How Grouped Experts Revolutionize Load Balancing in Giant AI Models
Huawei’s breakthrough MoGE architecture achieves perfect device workload distribution at 72B parameters, boosting inference speed by 97%
The Critical Challenge: Why Traditional MoE Fails in Distributed Systems
When scaling large language models (LLMs), Mixture of Experts (MoE) has become essential for managing computational costs. The core principle is elegant: Not every input token requires full model activation. Imagine a hospital triage system where specialists handle specific cases. But this “routing” process hides a fundamental flaw:
graph TD
A[Input Token] --> B(Router)
B --> C{Expert Selection}
C --> D[Expert 1]
C --> E[Expert 2]
C --> F[...]
D --> G[Output]
E --> G
F --> G
In distributed systems, traditional MoE reveals critical weaknesses:
-
Hotspot Experts: Frequently activated experts create bottlenecks (like crowded cardiology wards) -
Device Underutilization: Experts rarely selected idle their hardware (empty pediatric clinics) -
Systemic Inefficiency: Overall speed constrained by slowest device (the “bucket effect”)
Huawei quantifies this imbalance through the Imbalance Score (IS):
IS(X)=\frac{1}{|X|}\left[\max_{i}T_i(X)-\min_{i}T_i(X)\right]
(Where (T_i) represents computation load on device (i))
Experimental data (Figure 2b) shows staggering results: With 64 experts, 8 activated per token, and 8 devices, traditional Top-K routing approaches 100% imbalance probability. Picture checkout lanes—some with 20 customers, others empty—wasting resources catastrophically.
Architectural Breakthrough: Inside Huawei’s MoGE Design
Core Mechanics of Mixture of Grouped Experts
The solution operates like a precision traffic control system:
graph LR
A[Input Token] --> B[Global Router]
B --> C{Expert Partitioning}
C --> D[Device 1 - Group 1]
C --> E[Device 2 - Group 2]
C --> F[Device 3 - Group 3]
D --> G[Intra-Group Top-K Selection]
E --> H[Intra-Group Top-K Selection]
F --> I[Intra-Group Top-K Selection]
G --> J[Weighted Output]
H --> J
I --> J
Key innovations:
-
Expert Partitioning: Distribute (N) experts equally across (M) devices (e.g., 64 experts → 8 groups) -
Grouped Routing: Select (K’) experts per group ((K’ = K/M)) -
Structural Balance: Guarantee equal workload per device
Pangu Pro MoE configuration (Table 1):
Parameter | Value |
---|---|
Total Experts | 64 |
Device Groups (M) | 8 |
Experts per Group | 8 |
Activated per Group | 1 |
Activated Params | 16.5B |
Total Params | 71.99B |
Dual-Layer Balancing System
graph TB
A[Structural Balance] -->|Device-Level| B(IS=0)
C[Auxiliary Loss] -->|Intra-Group| D(Prevents Expert Underuse)
Auxiliary loss function:
ℓ_aux = α∑f_i·p_i
(Where (f_i) = token allocation rate, (p_i) = average gating score)
This acts like a manager’s KPI: Maximize expert utilization while ensuring optimal token routing.
Hardware Synergy: Ascend NPU Optimization Secrets
Training System Mastery (4K Ascend 800T A2 Cluster)
# Parallel configuration
parallel_config = {
"Tensor Parallelism": 8,
"Expert Parallelism": 2,
"Pipeline Parallelism": 5,
"Virtual Pipeline": 5,
"Context Parallelism": 1
}
Revolutionary techniques:
-
Hierarchical EP Comm: 50% less communication volume -
Adaptive Pipeline Overlap: Compute-communication parallelism -
Operator Fusion: Custom Matmul kernels boost cube utilization by 10%
Results (Table 2):
Configuration | MFU Gain | Key Tech |
---|---|---|
Baseline | – | TP1, EP8, CP8, PP6, VPP4 |
Optimized | +35% | TP8, EP2, CP1, PP5, VPP5 |
Inference Acceleration Triad
-
H²P Hybrid Parallelism
-
Attention: DP2+TP4 -
Experts: TP2+EP4 hybrid -
Shared Experts: TP8 dense compute
-
-
Expert-Aware Quantization
Solves MoE-specific challenges:\overline{s}_j=\max\left(\underbrace{\max_{i}\left(\frac{\max(|\mathbf{x}_j|)^α}{\max(|\mathbf{W}^s_j|)^{1-α}}\right)}_{\text{Expert Requirement}},\underbrace{\max(|\mathbf{x}_j|)^α}_{\text{Router Requirement}}\right)
-
Kernel Fusion Breakthroughs
-
MulAttention: 4.5x speedup sequenceDiagram Memory->>Cache: Bulk KV Transfer (Steps 1,5) Cache->>Compute: Dual-Loop Pipeline (Steps 2-4,6-8) Compute->>Output: 89% Utilization via Ping-Pong Buffering
-
SwiftGMM: 95% hardware utilization graph LR A[Dynamic Load] --> B{Predict Tiling Params} B --> C[GEMV/GEMM Mode Switch] C --> D[Full-Matrix L1 Cache] D --> E[Dual-Buffer Overlap]
-
## Performance Benchmark: Redefining State-of-the-Art
### Dominating LLM Leaderboards (Tables 3,4)
| Benchmark | Pangu Score | Best Competitor | Advantage |
|-------------------------|-------------|-----------------|-----------|
| MMLU (Understanding) | 87.4 | 84.2 (Qwen2.5) | +3.2 |
| C-Eval (Chinese Knowledge)| 90.6 | 87.7 (Qwen2.5) | +2.9 |
| GSM8K (Math Reasoning) | 86.5 | 85.4 (GLM4) | +1.1 |
| HumanEval (Coding) | 63.7 | 59.1 (GLM4) | +4.6 |
Critical finding: **Matches 32B dense models using only 16B activated parameters!**
### Inference Speed Revolution
**Ascend 800I A2 Results (Tables 5,6)**:
| Phase | Batch Size | Pangu Throughput | 72B Dense Model | Gain |
|-----------|------------|------------------|-----------------|-------|
| Prefill | 2 | 4,828 tokens/s | 1,596 tokens/s | +203% |
| Decode | 456 | 1,148 tokens/s | 583 tokens/s | +97% |
| Decode* | 584 | 1,528 tokens/s | - | - |
(*With multi-token prediction optimization)
**Ascend 300I Duo Cost Efficiency (Table 7)**:
| Phase | Batch Size | Latency | Throughput |
|----------|------------|-----------|----------------|
| Prefill | 2 | 1,940 ms | 1,055 tokens/s |
| Decode | 80 | 99.5 ms | 201 tokens/s |
| Decode* | 128 | 99.7 ms | 321 tokens/s |
### Expert Behavior Analysis
**Specialization Evolution (Figure 7)**:
```mermaid
graph LR
A[Layer 0] -->|Uniform| B[General Feature Extraction]
B --> C[Layer 23]
C -->|Emerging Specialization| D[Layer 47]
D -->|Task-Specific| E[Targeted Processing]
-
Language tasks: Balanced expert activation -
Math/coding: High expert specialization
Load Balance Proof (Figure 10):
-
Traditional MoE: 30% expert load variance -
MoGE: Near-perfect 12.5% distribution (σ < 2%)
Real-World Case Studies: Beyond Benchmarks
Cultural Understanding (Table 13)
Question: Which hand faces outward in Chinese “gongshou” etiquette?
Competitor Error: “Right hand = friendliness” (culturally inaccurate)
Pangu’s Correct Response:
“Left hand outward, right hand inward—rooted in ancient Chinese tradition where ‘left’ denotes respect (e.g., Left Chancellor > Right Chancellor)”
Mathematical Reasoning (Table 10)
Complex Expression: 28.97-(35%)-82*40-58.87
Critical Challenge: Contextual interpretation of “%” symbol
Pangu’s Solution:
Step 1: 35% → 0.35 (scalar conversion)
Step 2: 28.97 - 0.35 = 28.62
Step 3: 82*40 = 3,280
Step 4: 28.62 - 3,280 = -3,251.38
Step 5: -3,251.38 - 58.87 = -3,310.25
Instruction Compliance (Table 12)
Strict Requirements:
-
Reproduce query verbatim -
Include 6+ “!” -
No prefixed text
Pangu’s Flawless Execution:
Write a short startup pitch... [exact query reproduction]
Presenting "Sunnis Ice Cream"-- the revolutionary treat!!! ... stomach-approved!!! Join the revolution!!!
The Road Ahead: Implications for Sparse Models
Pangu Pro MoE’s breakthroughs demonstrate how architectural innovation solves systemic constraints:
-
Hardware-Aware Design: Communication topology optimized for Ascend NPUs -
Dynamic Load Balancing: MoGE eliminates device-level imbalance by design -
Efficiency-Accuracy Tradeoff: <0.8% accuracy loss under W8A8 quantization (Table 8)
pie
title Sub-100B Parameter Model Ecosystem
“MoGE Architecture” : 45
“Traditional MoE” : 30
“Dense Models” : 25
Technical Appendix: Key Questions Answered
Q1: Why does traditional MoE fail in distributed systems?
A: Random global Top-K routing causes uneven expert activation. At batch size=16, imbalance probability reaches 99%—like assigning 10 patients to one doctor while others sit idle.
Q2: How does MoGE prevent expert underutilization?
A: Dual safeguards:
-
Structural Enforcement: Fixed activations per group -
Auxiliary Loss: Penalizes intra-group imbalance via ( \ell_{aux} = \alpha \sum f_i \cdot p_i )
Q3: Can 72B models run on edge devices?
A: Ascend 300I Duo achieves 201 tokens/s via:
-
W8A8 Quantization: 50% memory reduction -
H²P Parallelism: Optimized 4-card deployment -
KV Cache Compression: 63% less communication
Q4: What does expert specialization reveal?
A: Layer-wise analysis (Figure 7) shows:
-
Early layers: General processing (uniform activation) -
Deep layers: Task-specific focus (e.g., math experts) -
Balance maintenance: <5% intra-group utilization variance (Figure 9)
Paper: https://gitcode.com/ascend-tribe/pangu-pro-moe