MiniCPM4: Run Powerful Language Models on Your Phone or Laptop
Achieve 128K context processing with 78% less training data using 0.5B/8B parameter models optimized for edge devices
Why We Need On-Device Language Models
While cloud-based AI models like ChatGPT dominate the landscape, edge devices (smartphones, laptops, IoT systems) have remained largely excluded due to computational constraints. Traditional large language models face three fundamental barriers:
-
Compute Overload: Processing 128K context requires calculating all token relationships -
Memory Constraints: Loading an 8B parameter model demands ~32GB RAM -
Training Costs: Standard models require 36 trillion training tokens
MiniCPM Team’s breakthrough solution, MiniCPM4, shatters these limitations through four-dimensional innovation:
-
📌 Model Architecture: Trainable sparse attention (InfLLM v2) -
📌 Training Data: UltraClean filtering + UltraChat v2 instruction dataset -
📌 Training Algorithms: ModelTunnel v2 hyperparameter search + BitCPM quantization -
📌 Inference Systems: CPM.cu engine + ArkInfer cross-platform deployment
1. Core Technology Breakdown
1.1 InfLLM v2: Dynamic Sparse Attention
The Challenge: Standard attention mechanisms face quadratic computation growth. Processing 128K tokens requires >16 billion relationship calculations.
Our Solution:
graph LR
A[Input Text] --> B[Split into 128-token blocks]
B --> C[Construct Semantic Kernels]
C --> D[Dynamically Select Top-K Relevant Blocks]
D --> E[Compute Attention Only on Critical Blocks]
-
Semantic Kernels: Overlapping text segments (32 tokens/kernel, stride 16) -
Dynamic Block Selection: Each query token processes only Top-6 relevant blocks -
Local Preservation: Mandatory inclusion of initial tokens + sliding window context
Performance Benchmarks:
Context Length | Standard Attention | InfLLM v2 | Speedup |
---|---|---|---|
4K | 16M operations | 3.2M ops | 5× |
128K | 16.4B operations | 1.2B ops | 13.7× |
On Jetson AGX Orin hardware, outperforms Qwen3-8B by 7× with 80% memory reduction.
🔍 Technical Insight: Kernel optimization avoids token-level computation using uint64 bitmask compression for tree-based decoding.
1.2 UltraClean: High-Quality Training Data Pipeline
The Challenge: Web-crawled datasets contain ads, templated text, and low-value content requiring trillions of tokens for baseline performance.
Our Method:
def UltraClean_workflow():
# Efficient Validation: Fine-tune near-trained models
candidate_data = load_corpus('FineWeb')
model = load_pretrained('MiniCPM-1.2B')
perf_gain = evaluate_finetune(model, candidate_data)
# Classifier Training: High-gain data as positive samples
high_quality_seeds = filter_by_gain(perf_gain, threshold=0.3)
classifier = train_fastText_model(high_quality_seeds)
# Iterative Optimization: Dynamic sample balancing
while not stabilized:
filtered_data = classifier.apply(raw_data)
perf_gain = validate(filtered_data)
adjust_classifier(perf_gain)
return UltraFineWeb
Dataset Comparison:
Dataset | MMLU Score | C-Eval Score | Training Tokens |
---|---|---|---|
FineWeb | 42.28 | 33.95 | 36T |
FineWeb-edu | 44.56 | 34.17 | 24T |
UltraFineWeb | 45.89 | 34.26 | 8T |
Achieves superior performance with 78% less data than conventional approaches.
1.3 ModelTunnel v2: Hyperparameter Search Revolution
Innovations:
-
ScalingBench Metric: GPT-4o generated reasoning chains replace loss metrics -
μP Architecture: Parameterization scaling enables cross-model hyperparameter transfer -
Multi-Token Prediction: Simultaneous prediction of 4 future tokens
graph TB
S[Small Model Experiments] --> A[Search Learning Rate/Batch Size]
A --> B[Establish ScalingBench Performance Mapping]
B --> C[Transfer to 8B Model]
C --> D[FP8 Mixed-Precision Training]
D --> E[Multi-Token Supervision]
Experimental result: Reduces hyperparameter search cost from 4,600 GPU hours to 110 hours with <0.5% error margin.
1.4 BitCPM4: Ternary Quantized LLMs
Edge Deployment Challenge: 16GB storage needed for FP16 8B models.
Our Approach:
-
Two-Stage Quantization: -
Full-precision training → 2. 350B token QAT fine-tuning
-
-
Weight Ternary Encoding: Parameters constrained to {-1, 0, +1} -
Full-Precision Activations: Prevents error accumulation
Quantization Results:
Model | Parameters | Precision | MMLU | Storage |
---|---|---|---|---|
Qwen3 | 0.6B | BF16 | 42.95 | 1.2GB |
BitCPM4 | 0.5B | Ternary | 49.88 | 0.1GB |
BitNet | 2B | Ternary | 53.17 | 0.25GB |
💡 Key Technique: Learning rate rewarmup (5e-3) with 128×128 block quantization.
2. Inference Deployment Guide
2.1 CPM.cu: Edge-Optimized Inference Engine
Core Architecture:
class CPM_cu:
def __init__(self):
self.fr_spec = FR_Spec(top_vocab=25%) # High-frequency token pruning
self.p_gptq = PrefixAwareGPTQ(ignore_first=4) # Skip initial outliers
self.sparse_attn = InfLLMv2_Kernel(block_size=128)
def execute(self, input):
# Context-aware quantization
quantized_weights = self.p_gptq.compress(model)
# Frequency-based drafting
draft_sequence = self.fr_spec.generate(input)
# Sparse attention verification
output = self.sparse_attn.verify(draft_sequence)
return output
Speed Comparison (RTX 4090):
Model | Context Length | Throughput (tokens/sec) |
---|---|---|
Qwen3-8B | 32K | 18.7 |
GLM4-9B | 32K | 22.4 |
MiniCPM4 | 32K | 56.3 |
MiniCPM4 | 128K | 41.9 |
2.2 ArkInfer: Universal Deployment System
Cross-Platform Design:
graph LR
A[Unified Tensor API] --> B[Abstract Executor]
B --> C[Adaptation Layer]
C --> D[NeuroPilot-MTK]
C --> E[Genie-Qualcomm]
C --> F[RK-LLM-Rockchip]
C --> G[llama.cpp-CPU]
Key Features:
-
Toolchain Validation: Auto-verify tool functionality (Claude-4 generated test cases) -
Constrained Decoding: Enforce JSON/SQL output formats -
Speculative Execution: BiTA algorithm integration
Real-world performance: Snapdragon 8 Gen3 smartphones process 128K context in <8 seconds.
3. Real-World Applications
3.1 MiniCPM4-Survey: Automated Research Assistant
Workflow:
User Query → Outline Planning → Knowledge Retrieval → Multi-pass Writing → RL Optimization
Training Methodology:
-
Supervised Fine-Tuning: 61,684 academic samples -
Chapter-Level RL: 12-dimensional reward (structure, factuality, novelty) -
Document-Level RL: Parallel environment interactions
Performance (SurveyEval Benchmark):
System | Content Quality | FactScore |
---|---|---|
Naive RAG | 3.04 | 43.68 |
OpenAI Deep Research | 3.50 | – |
MiniCPM4-Survey | 3.50 | 68.73 |
3.2 MiniCPM4-MCP: Tool Integration Protocol
Protocol Advantages:
-
Standardized Schema: Unified tool description format -
Dynamic Discovery: Docker-containerized tool ecosystem -
Cross-Tool Chaining: Single query triggers multi-tool workflows
Accuracy Comparison:
Tool Category | GPT-4o | Qwen3-8B | MiniCPM4-MCP |
---|---|---|---|
Code Execution | 85.0% | 80.0% | 90.0% |
Academic Search | 85.7% | 81.8% | 57.1% |
Document Processing | 95.8% | 94.9% | 95.1% |
Overall Accuracy | 80.2% | 83.5% | 88.3% |
4. Comprehensive Performance Evaluation
4.1 Benchmark Comparisons
Model | Parameters | Training Data | MMLU | CMMLU | GSM8K | Average |
---|---|---|---|---|---|---|
Llama3.2 | 1B | 9T | 46.89 | 23.73 | 39.76 | 34.76 |
Qwen3 | 0.6B | 36T | 42.95 | 42.05 | 61.71 | 44.93 |
MiniCPM4 | 0.5B | 8T | 55.55 | 65.22 | 52.08 | 52.99 |
Phi4 | 14B | 10T | 81.61 | 67.56 | 94.77 | 78.47 |
MiniCPM4 | 8B | 8T | 75.83 | 80.62 | 91.51 | 81.13 |
📌 Key Insight: The 8B version outperforms Phi4-14B by 13 points on Chinese reasoning (CMMLU) with 4× less training data.
4.2 Long-Context Processing
Needle-in-Haystack Test (128K context):
-
Accuracy: 100% (vs. 99.2% for full-attention baselines) -
Attention Sparsity: Only 5% context blocks activated -
Positional Encoding: 32K training → 128K extrapolation (YaRN technique)
5. Developer Resources
5.1 Open-Source Assets
# Model Access
- HuggingFace Repository:
- [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)
- [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B)
# Codebase
- GitHub Project:
[https://github.com/openbmb/minicpm](https://github.com/openbmb/minicpm)
# Deployment Platforms
- ArkInfer Supported Hardware:
- NVIDIA Jetson Series
- Qualcomm Snapdragon
- Rockchip RK3588
5.2 Edge Deployment Example
# Install ArkInfer
pip install arkinfer
# Load Model
from arkinfer import Executor
model = Executor.load("MiniCPM4-0.5B", accelerator="rk3588")
# Long-Context Inference
result = model.generate(
input_text="Summarize this 128K document's key arguments...",
max_length=4096,
sparse_attention=True
)
Technical FAQ
❓ How to choose between 0.5B and 8B versions?
Deployment Recommendations:
-
Mobile/IoT Devices → MiniCPM4-0.5B (0.1GB storage) -
Desktops/Workstations → MiniCPM4-8B (higher accuracy) -
Extreme Memory Constraints → BitCPM4 (ternary quantization)
❓ What are the advantages of on-device deployment?
Comparative Analysis:
Factor | Cloud Models | MiniCPM4 On-Device |
---|---|---|
Latency | 200-500ms | <50ms |
Privacy | Data transmission | Complete local processing |
Offline Usage | Network-dependent | Fully offline |
Cost | $0.01/1K tokens | One-time deployment |
❓ How to reproduce research results?
Key Steps:
-
Data Preparation: Apply UltraClean to FineWeb dataset -
Hyperparameter Search: Execute ModelTunnel v2 workflow -
Four-Stage Training: -
Stable pretraining (7T tokens) -
Annealing phase (1T tokens) -
Long-context extension (20B tokens) -
SFT + RL fine-tuning
-
-
Quantization: Use open-sourced BitCPM4 training scripts
Conclusion: MiniCPM4’s algorithm-data-system co-design enables 80+ benchmark-performing LLMs to run efficiently on edge devices. This breakthrough unlocks new possibilities for mobile AI, privacy-preserving computing, and real-time decision making – accelerating the transition from cloud-centric to edge-native intelligent systems.