MiniCPM4: Run Powerful Language Models on Your Phone or Laptop

Achieve 128K context processing with 78% less training data using 0.5B/8B parameter models optimized for edge devices

Why We Need On-Device Language Models

While cloud-based AI models like ChatGPT dominate the landscape, edge devices (smartphones, laptops, IoT systems) have remained largely excluded due to computational constraints. Traditional large language models face three fundamental barriers:

Compute Overload: Processing 128K context requires calculating all token relationships
Memory Constraints: Loading an 8B parameter model demands ~32GB RAM
Training Costs: Standard models require 36 trillion training tokens

MiniCPM Team’s breakthrough solution, MiniCPM4, shatters these limitations through four-dimensional innovation:

📌 Model Architecture: Trainable sparse attention (InfLLM v2)
📌 Training Data: UltraClean filtering + UltraChat v2 instruction dataset
📌 Training Algorithms: ModelTunnel v2 hyperparameter search + BitCPM quantization
📌 Inference Systems: CPM.cu engine + ArkInfer cross-platform deployment

1. Core Technology Breakdown

1.1 InfLLM v2: Dynamic Sparse Attention

The Challenge: Standard attention mechanisms face quadratic computation growth. Processing 128K tokens requires >16 billion relationship calculations.

Our Solution:

graph LR
A[Input Text] --> B[Split into 128-token blocks]
B --> C[Construct Semantic Kernels] 
C --> D[Dynamically Select Top-K Relevant Blocks]
D --> E[Compute Attention Only on Critical Blocks]

Semantic Kernels: Overlapping text segments (32 tokens/kernel, stride 16)
Dynamic Block Selection: Each query token processes only Top-6 relevant blocks
Local Preservation: Mandatory inclusion of initial tokens + sliding window context

Performance Benchmarks:

Context Length	Standard Attention	InfLLM v2	Speedup
4K	16M operations	3.2M ops	5×
128K	16.4B operations	1.2B ops	13.7×

On Jetson AGX Orin hardware, outperforms Qwen3-8B by 7× with 80% memory reduction.

🔍 Technical Insight: Kernel optimization avoids token-level computation using uint64 bitmask compression for tree-based decoding.

1.2 UltraClean: High-Quality Training Data Pipeline

The Challenge: Web-crawled datasets contain ads, templated text, and low-value content requiring trillions of tokens for baseline performance.

Our Method:

def UltraClean_workflow():
    # Efficient Validation: Fine-tune near-trained models
    candidate_data = load_corpus('FineWeb') 
    model = load_pretrained('MiniCPM-1.2B')
    perf_gain = evaluate_finetune(model, candidate_data)
    
    # Classifier Training: High-gain data as positive samples
    high_quality_seeds = filter_by_gain(perf_gain, threshold=0.3)
    classifier = train_fastText_model(high_quality_seeds)
    
    # Iterative Optimization: Dynamic sample balancing
    while not stabilized:
        filtered_data = classifier.apply(raw_data)
        perf_gain = validate(filtered_data)
        adjust_classifier(perf_gain)
    return UltraFineWeb

Dataset Comparison:

Dataset	MMLU Score	C-Eval Score	Training Tokens
FineWeb	42.28	33.95	36T
FineWeb-edu	44.56	34.17	24T
UltraFineWeb	45.89	34.26	8T

Achieves superior performance with 78% less data than conventional approaches.

1.3 ModelTunnel v2: Hyperparameter Search Revolution

Innovations:

ScalingBench Metric: GPT-4o generated reasoning chains replace loss metrics
μP Architecture: Parameterization scaling enables cross-model hyperparameter transfer
Multi-Token Prediction: Simultaneous prediction of 4 future tokens

graph TB
S[Small Model Experiments] --> A[Search Learning Rate/Batch Size]
A --> B[Establish ScalingBench Performance Mapping]
B --> C[Transfer to 8B Model]
C --> D[FP8 Mixed-Precision Training]
D --> E[Multi-Token Supervision]

Experimental result: Reduces hyperparameter search cost from 4,600 GPU hours to 110 hours with <0.5% error margin.

1.4 BitCPM4: Ternary Quantized LLMs

Edge Deployment Challenge: 16GB storage needed for FP16 8B models.

Our Approach:

Two-Stage Quantization:
1. Full-precision training → 2. 350B token QAT fine-tuning
Weight Ternary Encoding: Parameters constrained to {-1, 0, +1}
Full-Precision Activations: Prevents error accumulation

Quantization Results:

Model	Parameters	Precision	MMLU	Storage
Qwen3	0.6B	BF16	42.95	1.2GB
BitCPM4	0.5B	Ternary	49.88	0.1GB
BitNet	2B	Ternary	53.17	0.25GB

💡 Key Technique: Learning rate rewarmup (5e-3) with 128×128 block quantization.

2. Inference Deployment Guide

2.1 CPM.cu: Edge-Optimized Inference Engine

Core Architecture:

class CPM_cu:
    def __init__(self):
        self.fr_spec = FR_Spec(top_vocab=25%)  # High-frequency token pruning
        self.p_gptq = PrefixAwareGPTQ(ignore_first=4)  # Skip initial outliers
        self.sparse_attn = InfLLMv2_Kernel(block_size=128)

    def execute(self, input):
        # Context-aware quantization
        quantized_weights = self.p_gptq.compress(model)
        
        # Frequency-based drafting
        draft_sequence = self.fr_spec.generate(input)
        
        # Sparse attention verification
        output = self.sparse_attn.verify(draft_sequence)
        return output

Speed Comparison (RTX 4090):

Model	Context Length	Throughput (tokens/sec)
Qwen3-8B	32K	18.7
GLM4-9B	32K	22.4
MiniCPM4	32K	56.3
MiniCPM4	128K	41.9

2.2 ArkInfer: Universal Deployment System

Cross-Platform Design:

graph LR
A[Unified Tensor API] --> B[Abstract Executor]
B --> C[Adaptation Layer]
C --> D[NeuroPilot-MTK]
C --> E[Genie-Qualcomm]
C --> F[RK-LLM-Rockchip]
C --> G[llama.cpp-CPU]

Key Features:

Toolchain Validation: Auto-verify tool functionality (Claude-4 generated test cases)
Constrained Decoding: Enforce JSON/SQL output formats
Speculative Execution: BiTA algorithm integration

Real-world performance: Snapdragon 8 Gen3 smartphones process 128K context in <8 seconds.

3. Real-World Applications

3.1 MiniCPM4-Survey: Automated Research Assistant

Workflow:

User Query → Outline Planning → Knowledge Retrieval → Multi-pass Writing → RL Optimization

Training Methodology:

Supervised Fine-Tuning: 61,684 academic samples
Chapter-Level RL: 12-dimensional reward (structure, factuality, novelty)
Document-Level RL: Parallel environment interactions

Performance (SurveyEval Benchmark):

System	Content Quality	FactScore
Naive RAG	3.04	43.68
OpenAI Deep Research	3.50	–
MiniCPM4-Survey	3.50	68.73

3.2 MiniCPM4-MCP: Tool Integration Protocol

Protocol Advantages:

Standardized Schema: Unified tool description format
Dynamic Discovery: Docker-containerized tool ecosystem
Cross-Tool Chaining: Single query triggers multi-tool workflows

Accuracy Comparison:

Tool Category	GPT-4o	Qwen3-8B	MiniCPM4-MCP
Code Execution	85.0%	80.0%	90.0%
Academic Search	85.7%	81.8%	57.1%
Document Processing	95.8%	94.9%	95.1%
Overall Accuracy	80.2%	83.5%	88.3%

4. Comprehensive Performance Evaluation

4.1 Benchmark Comparisons

Model	Parameters	Training Data	MMLU	CMMLU	GSM8K	Average
Llama3.2	1B	9T	46.89	23.73	39.76	34.76
Qwen3	0.6B	36T	42.95	42.05	61.71	44.93
MiniCPM4	0.5B	8T	55.55	65.22	52.08	52.99
Phi4	14B	10T	81.61	67.56	94.77	78.47
MiniCPM4	8B	8T	75.83	80.62	91.51	81.13

📌 Key Insight: The 8B version outperforms Phi4-14B by 13 points on Chinese reasoning (CMMLU) with 4× less training data.

4.2 Long-Context Processing

Needle-in-Haystack Test (128K context):

Accuracy: 100% (vs. 99.2% for full-attention baselines)
Attention Sparsity: Only 5% context blocks activated
Positional Encoding: 32K training → 128K extrapolation (YaRN technique)

5. Developer Resources

5.1 Open-Source Assets

# Model Access
- HuggingFace Repository:
  - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)
  - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B)

# Codebase
- GitHub Project:  
  [https://github.com/openbmb/minicpm](https://github.com/openbmb/minicpm)

# Deployment Platforms
- ArkInfer Supported Hardware:
  - NVIDIA Jetson Series
  - Qualcomm Snapdragon
  - Rockchip RK3588

5.2 Edge Deployment Example

# Install ArkInfer
pip install arkinfer

# Load Model
from arkinfer import Executor
model = Executor.load("MiniCPM4-0.5B", accelerator="rk3588")

# Long-Context Inference
result = model.generate(
   input_text="Summarize this 128K document's key arguments...",
   max_length=4096,
   sparse_attention=True
)

Technical FAQ

❓ How to choose between 0.5B and 8B versions?

Deployment Recommendations:

Mobile/IoT Devices → MiniCPM4-0.5B (0.1GB storage)
Desktops/Workstations → MiniCPM4-8B (higher accuracy)
Extreme Memory Constraints → BitCPM4 (ternary quantization)

❓ What are the advantages of on-device deployment?

Comparative Analysis:

Factor	Cloud Models	MiniCPM4 On-Device
Latency	200-500ms	<50ms
Privacy	Data transmission	Complete local processing
Offline Usage	Network-dependent	Fully offline
Cost	$0.01/1K tokens	One-time deployment

❓ How to reproduce research results?

Key Steps:

Data Preparation: Apply UltraClean to FineWeb dataset
Hyperparameter Search: Execute ModelTunnel v2 workflow
Four-Stage Training:
- Stable pretraining (7T tokens)
- Annealing phase (1T tokens)
- Long-context extension (20B tokens)
- SFT + RL fine-tuning
Quantization: Use open-sourced BitCPM4 training scripts

Conclusion: MiniCPM4’s algorithm-data-system co-design enables 80+ benchmark-performing LLMs to run efficiently on edge devices. This breakthrough unlocks new possibilities for mobile AI, privacy-preserving computing, and real-time decision making – accelerating the transition from cloud-centric to edge-native intelligent systems.

On-Device Language Models: How MiniCPM4 Achieves 128K Context AI on Mobile Devices