MiniCPM4: Run Powerful Language Models on Your Phone or Laptop

Achieve 128K context processing with 78% less training data using 0.5B/8B parameter models optimized for edge devices

Why We Need On-Device Language Models

While cloud-based AI models like ChatGPT dominate the landscape, edge devices (smartphones, laptops, IoT systems) have remained largely excluded due to computational constraints. Traditional large language models face three fundamental barriers:

  1. Compute Overload: Processing 128K context requires calculating all token relationships
  2. Memory Constraints: Loading an 8B parameter model demands ~32GB RAM
  3. Training Costs: Standard models require 36 trillion training tokens

MiniCPM Team’s breakthrough solution, MiniCPM4, shatters these limitations through four-dimensional innovation:

  • 📌 Model Architecture: Trainable sparse attention (InfLLM v2)
  • 📌 Training Data: UltraClean filtering + UltraChat v2 instruction dataset
  • 📌 Training Algorithms: ModelTunnel v2 hyperparameter search + BitCPM quantization
  • 📌 Inference Systems: CPM.cu engine + ArkInfer cross-platform deployment

1. Core Technology Breakdown

1.1 InfLLM v2: Dynamic Sparse Attention

The Challenge: Standard attention mechanisms face quadratic computation growth. Processing 128K tokens requires >16 billion relationship calculations.

Our Solution:

graph LR
A[Input Text] --> B[Split into 128-token blocks]
B --> C[Construct Semantic Kernels] 
C --> D[Dynamically Select Top-K Relevant Blocks]
D --> E[Compute Attention Only on Critical Blocks]
  • Semantic Kernels: Overlapping text segments (32 tokens/kernel, stride 16)
  • Dynamic Block Selection: Each query token processes only Top-6 relevant blocks
  • Local Preservation: Mandatory inclusion of initial tokens + sliding window context

Performance Benchmarks:

Context Length Standard Attention InfLLM v2 Speedup
4K 16M operations 3.2M ops
128K 16.4B operations 1.2B ops 13.7×

On Jetson AGX Orin hardware, outperforms Qwen3-8B by with 80% memory reduction.

🔍 Technical Insight: Kernel optimization avoids token-level computation using uint64 bitmask compression for tree-based decoding.

1.2 UltraClean: High-Quality Training Data Pipeline

The Challenge: Web-crawled datasets contain ads, templated text, and low-value content requiring trillions of tokens for baseline performance.

Our Method:

def UltraClean_workflow():
    # Efficient Validation: Fine-tune near-trained models
    candidate_data = load_corpus('FineWeb') 
    model = load_pretrained('MiniCPM-1.2B')
    perf_gain = evaluate_finetune(model, candidate_data)
    
    # Classifier Training: High-gain data as positive samples
    high_quality_seeds = filter_by_gain(perf_gain, threshold=0.3)
    classifier = train_fastText_model(high_quality_seeds)
    
    # Iterative Optimization: Dynamic sample balancing
    while not stabilized:
        filtered_data = classifier.apply(raw_data)
        perf_gain = validate(filtered_data)
        adjust_classifier(perf_gain)
    return UltraFineWeb

Dataset Comparison:

Dataset MMLU Score C-Eval Score Training Tokens
FineWeb 42.28 33.95 36T
FineWeb-edu 44.56 34.17 24T
UltraFineWeb 45.89 34.26 8T

Achieves superior performance with 78% less data than conventional approaches.

1.3 ModelTunnel v2: Hyperparameter Search Revolution

Innovations:

  • ScalingBench Metric: GPT-4o generated reasoning chains replace loss metrics
  • μP Architecture: Parameterization scaling enables cross-model hyperparameter transfer
  • Multi-Token Prediction: Simultaneous prediction of 4 future tokens
graph TB
S[Small Model Experiments] --> A[Search Learning Rate/Batch Size]
A --> B[Establish ScalingBench Performance Mapping]
B --> C[Transfer to 8B Model]
C --> D[FP8 Mixed-Precision Training]
D --> E[Multi-Token Supervision]

Experimental result: Reduces hyperparameter search cost from 4,600 GPU hours to 110 hours with <0.5% error margin.

1.4 BitCPM4: Ternary Quantized LLMs

Edge Deployment Challenge: 16GB storage needed for FP16 8B models.

Our Approach:

  • Two-Stage Quantization:

    1. Full-precision training → 2. 350B token QAT fine-tuning
  • Weight Ternary Encoding: Parameters constrained to {-1, 0, +1}
  • Full-Precision Activations: Prevents error accumulation

Quantization Results:

Model Parameters Precision MMLU Storage
Qwen3 0.6B BF16 42.95 1.2GB
BitCPM4 0.5B Ternary 49.88 0.1GB
BitNet 2B Ternary 53.17 0.25GB

💡 Key Technique: Learning rate rewarmup (5e-3) with 128×128 block quantization.


2. Inference Deployment Guide

2.1 CPM.cu: Edge-Optimized Inference Engine

Core Architecture:

class CPM_cu:
    def __init__(self):
        self.fr_spec = FR_Spec(top_vocab=25%)  # High-frequency token pruning
        self.p_gptq = PrefixAwareGPTQ(ignore_first=4)  # Skip initial outliers
        self.sparse_attn = InfLLMv2_Kernel(block_size=128)

    def execute(self, input):
        # Context-aware quantization
        quantized_weights = self.p_gptq.compress(model)
        
        # Frequency-based drafting
        draft_sequence = self.fr_spec.generate(input)
        
        # Sparse attention verification
        output = self.sparse_attn.verify(draft_sequence)
        return output

Speed Comparison (RTX 4090):

Model Context Length Throughput (tokens/sec)
Qwen3-8B 32K 18.7
GLM4-9B 32K 22.4
MiniCPM4 32K 56.3
MiniCPM4 128K 41.9

2.2 ArkInfer: Universal Deployment System

Cross-Platform Design:

graph LR
A[Unified Tensor API] --> B[Abstract Executor]
B --> C[Adaptation Layer]
C --> D[NeuroPilot-MTK]
C --> E[Genie-Qualcomm]
C --> F[RK-LLM-Rockchip]
C --> G[llama.cpp-CPU]

Key Features:

  • Toolchain Validation: Auto-verify tool functionality (Claude-4 generated test cases)
  • Constrained Decoding: Enforce JSON/SQL output formats
  • Speculative Execution: BiTA algorithm integration

Real-world performance: Snapdragon 8 Gen3 smartphones process 128K context in <8 seconds.


3. Real-World Applications

3.1 MiniCPM4-Survey: Automated Research Assistant

Workflow:

User Query → Outline Planning → Knowledge Retrieval → Multi-pass Writing → RL Optimization

Training Methodology:

  1. Supervised Fine-Tuning: 61,684 academic samples
  2. Chapter-Level RL: 12-dimensional reward (structure, factuality, novelty)
  3. Document-Level RL: Parallel environment interactions

Performance (SurveyEval Benchmark):

System Content Quality FactScore
Naive RAG 3.04 43.68
OpenAI Deep Research 3.50
MiniCPM4-Survey 3.50 68.73

3.2 MiniCPM4-MCP: Tool Integration Protocol

Protocol Advantages:

  • Standardized Schema: Unified tool description format
  • Dynamic Discovery: Docker-containerized tool ecosystem
  • Cross-Tool Chaining: Single query triggers multi-tool workflows

Accuracy Comparison:

Tool Category GPT-4o Qwen3-8B MiniCPM4-MCP
Code Execution 85.0% 80.0% 90.0%
Academic Search 85.7% 81.8% 57.1%
Document Processing 95.8% 94.9% 95.1%
Overall Accuracy 80.2% 83.5% 88.3%

4. Comprehensive Performance Evaluation

4.1 Benchmark Comparisons

Model Parameters Training Data MMLU CMMLU GSM8K Average
Llama3.2 1B 9T 46.89 23.73 39.76 34.76
Qwen3 0.6B 36T 42.95 42.05 61.71 44.93
MiniCPM4 0.5B 8T 55.55 65.22 52.08 52.99
Phi4 14B 10T 81.61 67.56 94.77 78.47
MiniCPM4 8B 8T 75.83 80.62 91.51 81.13

📌 Key Insight: The 8B version outperforms Phi4-14B by 13 points on Chinese reasoning (CMMLU) with 4× less training data.

4.2 Long-Context Processing

Needle-in-Haystack Test (128K context):

  • Accuracy: 100% (vs. 99.2% for full-attention baselines)
  • Attention Sparsity: Only 5% context blocks activated
  • Positional Encoding: 32K training → 128K extrapolation (YaRN technique)

5. Developer Resources

5.1 Open-Source Assets

# Model Access
- HuggingFace Repository:
  - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B)
  - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B)

# Codebase
- GitHub Project:  
  [https://github.com/openbmb/minicpm](https://github.com/openbmb/minicpm)

# Deployment Platforms
- ArkInfer Supported Hardware:
  - NVIDIA Jetson Series
  - Qualcomm Snapdragon
  - Rockchip RK3588

5.2 Edge Deployment Example

# Install ArkInfer
pip install arkinfer

# Load Model
from arkinfer import Executor
model = Executor.load("MiniCPM4-0.5B", accelerator="rk3588")

# Long-Context Inference
result = model.generate(
   input_text="Summarize this 128K document's key arguments...",
   max_length=4096,
   sparse_attention=True
)

Technical FAQ

❓ How to choose between 0.5B and 8B versions?

Deployment Recommendations:

  • Mobile/IoT Devices → MiniCPM4-0.5B (0.1GB storage)
  • Desktops/Workstations → MiniCPM4-8B (higher accuracy)
  • Extreme Memory Constraints → BitCPM4 (ternary quantization)

❓ What are the advantages of on-device deployment?

Comparative Analysis:

Factor Cloud Models MiniCPM4 On-Device
Latency 200-500ms <50ms
Privacy Data transmission Complete local processing
Offline Usage Network-dependent Fully offline
Cost $0.01/1K tokens One-time deployment

❓ How to reproduce research results?

Key Steps:

  1. Data Preparation: Apply UltraClean to FineWeb dataset
  2. Hyperparameter Search: Execute ModelTunnel v2 workflow
  3. Four-Stage Training:

    • Stable pretraining (7T tokens)
    • Annealing phase (1T tokens)
    • Long-context extension (20B tokens)
    • SFT + RL fine-tuning
  4. Quantization: Use open-sourced BitCPM4 training scripts

Conclusion: MiniCPM4’s algorithm-data-system co-design enables 80+ benchmark-performing LLMs to run efficiently on edge devices. This breakthrough unlocks new possibilities for mobile AI, privacy-preserving computing, and real-time decision making – accelerating the transition from cloud-centric to edge-native intelligent systems.