Efficient LLM Deployment on Ascend NPUs: Pangu Embedded & Pro MoE Guide

Efficient LLM Deployment on Ascend NPUs: Pangu Embedded & Pangu Pro MoE

In this post, we explore two complementary solutions from Huawei’s Pangu team—Pangu Embedded and Pangu Pro MoE—designed for low-latency and high-throughput inference on Ascend NPUs. Drawing exclusively on official technical reports, we translate and adapt core concepts into clear, engaging English suitable for junior college–level readers worldwide. We preserve every detail of system design, training methodology, and deployment best practices to deliver genuine, long‑term value without clickbait or hype.

LLM on Edge
Source: Unsplash

Why Efficient Inference Matters
Pangu Embedded: Fast & Slow Thinking with Metacognition
Pangu Pro MoE: Balanced Sparse Experts at Scale
Practical Deployment Tips on Ascend NPUs
Choosing the Right LLM for Your Use Case
Conclusion & Future Directions

Why Efficient Inference Matters

Large language models (LLMs) have set new benchmarks in tasks such as question answering, summarization, and code generation. Yet their computational demands and inference latency pose real barriers for real‑time applications—be it chatbots, on‑device assistants, or interactive analytics. Typical models with billions of parameters require vast memory bandwidth and incur delays that undermine user experience.

Compute constraints: Model sizes often exceed tens of billions of parameters, straining GPU/ NPU memory and compute resources.
Latency vs. throughput: Batching boosts tokens‑per‑second but adds per‑query delay. Users expect sub‑100 ms responses in conversational settings.
Sparse‑expert imbalance: Mixture‑of‑Experts (MoE) architectures promise higher capacity at lower cost, yet uneven expert activation leads to resource underutilization.

Ascend NPUs—Huawei’s high‑performance chips—bring specialized tensor pipelines and on‑chip memory, providing fertile ground for optimization. The Pangu team’s two offerings address these pain points:

Pangu Embedded: A 7 B‑parameter dual‑system LLM with “fast” and “slow” thinking modes, switching dynamically based on request complexity while maintaining a single unified model .
Pangu Pro MoE: A 72 B‑parameter sparse LLM using a novel Mixture of Grouped Experts (MoGE) to balance load across experts for higher tokens/sec and cost‑effective scaling .

We now dive into each solution, unpacking design rationale, training methodology, and real‑world performance.

Pangu Embedded: Fast & Slow Thinking with Metacognition

Pangu Embedded tackles the latency–quality trade‑off through a dual‑system architecture inspired by cognitive science: a lightweight “fast thinking” path for routine queries and a deeper “slow thinking” path for complex inference. A metacognition module routes each request to the appropriate path, balancing responsiveness and reasoning depth .

Fast vs Slow
Source: Pexels

Dual‑System Framework

Fast mode: A streamlined execution flow optimized for minimal operator graph complexity. Ideal for simple queries like factual lookups or short completions.
Slow mode: Invokes deeper transformer layers with richer context integration, suitable for logic puzzles, multi‑step reasoning, or long‑form summarization.
Single backbone: Both modes share core weights; no duplicate sub‑networks. This reduces maintenance overhead and ensures consistency in output quality.

When a query arrives, the metacognition module assesses its complexity and latency budget, then selects fast or slow mode—all within a unified 7 B‑parameter model .

Two‑Stage Training Pipeline

To equip Pangu Embedded with both speed and depth, the team devised a two‑stage training workflow:

Iterative Distillation
- A high‑capacity “teacher” LLM guides the Embedded model on routine tasks.
- Multi‑round distillation refines the lightweight variant, preserving answer quality in fast mode.
Reinforcement Optimization
- A Multi‑source Adaptive Reward System (MARS) drives slow‑mode improvements via reinforcement learning.
- MARS integrates feedback from diverse data sources to sharpen complex reasoning capabilities.
Model Merging
- Fast and slow weights are seamlessly merged, maintaining a single network.

This pipeline ensures stable baseline performance for everyday queries, while enhancing specialized reasoning where depth matters .

Metacognition Module

The metacognition module mimics human self‑monitoring:

Input analysis: Measures query length, lexical complexity, and preliminary scoring from a tiny heuristic model.
Cache lookup: Detects if similar queries were answered recently via a fast on‑chip cache.
Latency estimation: Predicts expected compute time for fast vs. slow path.

A combined score triggers mode selection, keeping ordinary requests within 50 ms while still handling hardest queries in 200–300 ms .

Benchmark Results

On Ascend NPUs, Pangu Embedded exhibits:

Model size: 7 B parameters
Latency:
- Fast mode: < 50 ms for simple prompts
- Slow mode: ~ 200–300 ms for complex tasks
Quality: Outperforms Qwen3‑8B and GLM4‑9B on AIME, GPQA, and other reasoning benchmarks .

By reducing average latency by up to 50%, Embedded empowers real‑time applications such as chatbots, virtual assistants, and interactive analytics pipelines.

Pangu Pro MoE: Balanced Sparse Experts at Scale

When task demands shift to high‑capacity scenarios—batch summarization, document drafting, or advanced research helpers—a larger LLM is needed. Yet deploying a dense 72 B‑parameter model can be cost‑prohibitive. Mixture‑of‑Experts (MoE) architectures offer a sparse alternative but often suffer from imbalanced expert activation. Pangu Pro MoE introduces Mixture of Grouped Experts (MoGE) to evenly distribute load and maximize throughput on Ascend NPUs .

Network Experts
Source: Unsplash

MoE Background & Challenges

Sparse activation: Each token triggers only a subset of expert layers, slashing overall computation.
Load imbalance: Naïve routing can oversubscribe popular experts while leaving others idle, underutilizing hardware.
Resource waste: Some NPU cores sit idle, undermining cost‑effectiveness.

To harness MoE at scale, Pangu Pro MoE must ensure balanced expert utilization.

Mixture of Grouped Experts (MoGE)

MoGE restructures expert routing via grouped activation:

Expert grouping
- Divide N experts into M disjoint groups of equal size.
Equal activation constraint
- During routing, enforce that each token activates the same number of experts per group.
Multi‑phase scheduling
- Combine static assignment (initial grouping) with dynamic adjustments based on real‑time load metrics.

This design achieves balanced utilization across groups, dramatically improving overall throughput on Ascend platforms .

Inference Throughput & Accuracy

Pangu Pro MoE scales effectively:

Total parameters: 72 B
Active parameters per token: 16 B
Throughput (Ascend 800I A2):
- Base: 1,148 tokens/s per card
- With speculative acceleration: up to 1,528 tokens/s
Cost efficiency (Ascend 300I Duo): ~ 30% compute saving vs. dense 72 B model .

Public benchmarks confirm top‑tier performance among sub‑trillion‑parameter models, making Pro MoE ideal for high‑volume tasks like bulk generation and long‑document processing.

Practical Deployment Tips on Ascend NPUs

Both Embedded and Pro MoE benefit from tailored hardware optimizations. Here are key considerations:

Leveraging Hardware Features

Operator fusion
- Merge adjacent attention and feed‑forward operators to reduce memory copies and scheduler overhead.
On‑chip caching
- Pin frequently accessed tensors in SRAM to cut off‑chip memory traffic.
Pipeline parallelism
- Split long sequences across pipeline stages for concurrent execution.

These techniques boost arithmetic intensity and lower end‑to‑end latency.

Latency vs. Throughput Trade‑offs

Real‑time apps
- Use batch size = 1, enable fast mode or minimal MoGE group activation to guarantee sub‑100 ms responses.
Batch processing
- Increase batch size and leverage multi‑device sharding to maximize tokens/sec for summarization or indexing jobs.
Dynamic thresholds
- Adjust Embedded’s metacognition thresholds based on current load to balance user experience and hardware utilization.

Scaling & Parallelism

Embedded
- 7 B model runs optimally on 8–16 Ascend cards with tensor‑model parallelism.
Pro MoE
- Each expert group can span 2–4 cards; overall system uses a mix of tensor and data parallel strategies to saturate multi‑card clusters.

Adopt a hybrid parallel approach—model + data—to fully exploit NPU bandwidth.

Choosing the Right LLM for Your Use Case

Model	Size	Strengths	Best Fit Scenarios
Pangu Embedded 7B	7 B params	Ultra‑low latency, adaptive depth	Chatbots, Q&A, short‑form generation
Pangu Pro MoE 72B	72 B params	High throughput, expressive power	Bulk generation, long‑form summaries, research pipelines

Real‑time interaction? Go with Embedded for millisecond‑level responses.
Volume‑driven tasks? Choose Pro MoE to handle massive token throughput with cost savings.

Conclusion & Future Directions

Huawei’s Pangu Embedded and Pangu Pro MoE exemplify how specialized software and hardware synergy can overcome classic LLM deployment challenges. Embedded’s dual‑system reasoning cuts latency by up to 50%, while MoGE’s balanced sparse experts deliver 40%+ throughput gains over dense peers—all on Ascend NPUs.

Looking ahead, we anticipate:

Adaptive routing: Automatically tuning group sizes and activation counts per workload.
Cross‑platform co‑deployment: Seamless handover between cloud NPUs and edge Aspend chips.
End‑to‑end auto‑tuning: Closed‑loop training and inference pipelines that self‑optimize based on live performance metrics.

By following these practices, you can accelerate adoption of LLMs in production, delivering responsive, high‑capacity AI services that scale economically.

All content above is strictly based on official Pangu technical reports.