MiniCPM4 & MiniCPM4.1: A Pocket-Sized 8 B-Parameter Model That Thinks—and Runs—at the Edge
(The no-hype, no-code-dump guide for junior developers, product managers, and tinkerers)
“Can I really run a GPT-3-class model on a lunch-box computer?”
If that question keeps you awake, this article is the sleeping pill. Everything below is copied straight from the official OpenBMB repositories (no extra facts, no fluff). I’ve only translated, re-ordered, and explained the bits that usually stay locked inside research papers.
1. Elevator summary
What | Number | Why it matters |
---|---|---|
Model size | 8 B parameters | Fits a 16 GB RTX 4070 at 16-bit, or a 6 GB card at 4-bit |
Context length | 128 k tokens | Full legal contract in one shot |
Speed on Jetson Orin | 7× faster than same-size Qwen3-8B | Real-time summarisation on 30 W power |
New “thinking” switch | 3× speed-up when turned off | Same weights, zero reload |
Licence | Apache 2.0 | Free for commercial use |
2. Four knives that carved out the speed
OpenBMB did not invent a bigger engine; they removed the dead weight.
-
Knife 1 – Sparse attention (InfLLM v2)
Every new token attends to <5 % of previous tokens in 128 k text.
Result: O(n²) → almost linear. -
Knife 2 – Predictable scaling (Model Wind-Tunnel 2.0)
Train a 0.1 B “toy” model first, find the best lr/warm-up/bsz, then copy those numbers to 8 B.
Cuts training cost ≈50 %. -
Knife 3 – 1.58-bit ternary weights (BitCPM)
Weights ∈ {–1, 0, 1}.
90 % smaller file, <4 % score drop. -
Knife 4 – Inference engine (CPM.cu)
Fuses sparse kernels, speculative decoding and 8-bit GEMM in one CUDA binary.
Delivers the paper numbers out of the box.
3. Which flavour fits my machine?
Model | File size (GB) | Min. VRAM | Device example | Use case |
---|---|---|---|---|
MiniCPM4.1-8B (bf16) | 16 | 16 GB | RTX 4080 Laptop | Max accuracy |
MiniCPM4.1-8B-GPTQ (4-bit) | 4.3 | 9 GB | RTX 3060 12 GB | Desktop QA |
MiniCPM4.1-8B-GGUF (Q4_K_M) | 4.1 | 6 GB | M2 MacBook Air | Silent office |
MiniCPM4-0.5B (bf16) | 1.2 | 2 GB | Raspberry Pi 5 | Voice assistant |
BitCPM4-0.5B (1.58-bit) | 0.3 | 0.8 GB | Cortex-A78 | Drone / IoT |
4. Install-and-run in three commands
The snippets below are copied 1-to-1 from the GitHub read-me; I only added line comments so you know why you are typing them.
4.1 One-time setup
# 1. Any Python ≥3.9 works
conda create -n minicpm python=3.10 -y
conda activate minicpm
# 2. Torch with CUDA 11.8 (change wheel if you use 12.x)
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
# 3. Hugging Face libs
pip install transformers==4.43.0 accelerate
4.2 Download weights (example: full-precision 8B)
git lfs install
git clone https://huggingface.co/openbmb/MiniCPM4.1-8B # 16 GB
4.3 Run your first prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, warnings, time
torch.manual_seed(0)
warnings.filterwarnings("ignore")
model_id = "MiniCPM4.1-8B"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True)
messages = [{"role": "user",
"content": "Explain quantum entanglement like I'm 12."}]
prompt = tok.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=False)
inputs = tok(prompt, return_tensors="pt").to("cuda")
start = time.time()
ids = model.generate(**inputs,
max_new_tokens=400,
temperature=0.6,
top_p=0.95)
print(tok.decode(ids[0][len(inputs.input_ids[0]):],
skip_special_tokens=True))
print(f"\nSpeed: {len(ids[0])/(time.time()-start):.1f} tokens/s")
Typical output on RTX 4090:
Speed: 82.3 tokens/s
That is it—no C++, no CMake, no protobuf pain.
5. The thinking switch—same weights, two personalities
MiniCPM4.1 embeds a special control token. Flip it on the fly:
Mode | How to enable | When to use |
---|---|---|
Deep reasoning (default) | enable_thinking=True or append /think |
Math proof, code logic |
Fast reply | enable_thinking=False or append /no_think |
Chat, classification |
Example:
# Fast reply
quick = tok.apply_chat_template(messages,
enable_thinking=False,
tokenize=False)
Benchmark (RTX 4090, 128 k input, 512 output):
-
Thinking ON: 18 tokens/s, F1 = 42.3 -
Thinking OFF: 55 tokens/s, F1 = 41.9
You trade 0.4 % quality for 3× speed—usually a good deal.
6. 128 k tokens on a toaster (Jetson Orin demo)
Hardware: Jetson AGX Orin 64 GB, 30 W TDP.
Task: find 20 clauses inside a 100 k token legal PDF.
# 1. Get the optimised inference engine
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu && python setup.py install
# 2. Produce a long prompt (script provided)
python tests/long_prompt_gen.py # writes prompt.txt
# 3. Fire
python tests/test_generate.py --prompt-file prompt.txt
Result copied from repo:
-
First-token latency 2.1 s (versus 14 s with HF transformers) -
Sustained 30 tokens/s
Power drawn: 28 W. The fan is quieter than a fridge.
7. Quantisation cookbook (GPTQ vs AWQ vs GGUF)
Format | Compression | Quality loss | Tool | Best for |
---|---|---|---|---|
GPTQ 4-bit | 50 % | <2 % | auto-gptq | NVIDIA desktop cards |
AWQ 4-bit | 50 % | <2 % | awq | Same, pick whichever is faster on your GPU |
GGUF Q4_K_M | 50 % | <3 % | llama.cpp | Apple Silicon, Windows iGPU |
BitCPM 1.58-bit | 90 % | <4 % | custom kernel | MCU-class edge boards |
Quick GPTQ example (also in repo):
# 1. Install auto-gptq
pip install auto-gptq optimum
# 2. One-line quantisation
optimum-cli export gptq-model openbmb/MiniCPM4.1-8B \
--output MiniCPM4.1-8B-GPTQ \
--bits 4 --group-size 128
Inference then needs only 9 GB VRAM and runs 1.3× faster because memory traffic shrinks.
8. Fine-tune your own assistant with LLaMA-Factory
Data: 5 k instruction rows is enough for style transfer.
Hardware: one RTX 4090 (24 GB) or A6000 (48 GB).
Time: ~3 h.
git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory
pip install -e .
# Edit examples/minicpm_lora_sft.yaml
# model_name_or_path: openbmb/MiniCPM4.1-8B
# template: minicpm
# finetuning_type: lora
llamafactory-cli train examples/minicpm_lora_sft.yaml
Output: a 200 MB LoRA adapter you can hot-plug:
model.load_adapter("saves/minicpm_lora/checkpoint-3000")
No merge needed; base model stays untouched.
9. Long-text ninja moves (RoPE & InfLLM v2)
MiniCPM4.1 ships with 64 k native context.
Need 128 k? Just extend the RoPE factors in config.json
:
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.998, 1.033, 1.075, ...], # 64 numbers
"short_factor": [0.998, 1.033, 1.075, ...],
"original_max_position_embeddings": 65536
}
Values are already calculated by the authors; copy-paste works.
For >64 k sparse attention, add:
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"block_size": 64,
"topk": 64,
"dense_len": 8192
}
Then install the companion CUDA kernel:
git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl
cd infllmv2_cuda_impl && pip install -e .
Without this step the model still runs, but falls back to standard attention and gets slower after 64 k.
10. Speed numbers you can quote in meetings
All figures are copied from the official markdown; I only rounded for readability.
Platform | Model | Context | Speed | vs Qwen3-8B |
---|---|---|---|---|
Jetson Orin | MiniCPM4-8B | 128 k | 30 t/s | 7× |
RTX 4090 | MiniCPM4.1-8B | 8 k | 82 t/s | 2.8× |
M2 Max | MiniCPM4.1-8B-GGUF Q4 | 4 k | 35 t/s | N/A |
RPi 5 | MiniCPM4-0.5B | 2 k | 12 t/s | N/A |
11. Real-world apps already in the box
-
MiniCPM4-Survey
Agent that writes 20-page survey papers from a single user query.
Planning → Retrieval → Writing multi-agent loop.
Fact-score 68.7, on par with GPT-4o DeepResearch (paper table). -
MiniCPM4-MCP
Tool-use agent speaking the Model Context Protocol (MCP).
Supports 16 servers: ArXiv, Slack, GitHub, Maps, Calculator …
Average tool-call accuracy 88 %, beats GPT-4o (80 %). -
MiniCPM Intel AIPC Client
Windows desktop client tuned for Intel Core Ultra.
OpenVINO backend, 80 tokens/s on Ultra 7, fanless mode.
All data stays on the laptop—handy for NDAs.
12. Troubleshooting cheat-sheet (copy-paste ready)
Symptom | Likely cause | One-line fix |
---|---|---|
OOM on 24 GB card | max_num_batched_tokens too high |
Set --max-num-batched-tokens 8192 |
128 k slower than 64 k | Sparse kernel not loaded | Install infllmv2_cuda_impl |
Gibberish after quant | Calibration data mismatch | Use in-domain 1 k samples |
M1 GPU not found | CUDA hard-coded | Switch to GGUF + mlx_lm |
LoRA layers ignored | Forgetting load_adapter() |
Call it after from_pretrained() |
13. When to pick MiniCPM and when to stay with cloud APIs
-
Privacy-bound (health, legal, gov) → run MiniCPM4.1-8B locally; GGUF Q4 needs only 6 GB. -
Sub-200 ms chat → edge 0.5 B model; first token <50 ms on phone SoC. -
Batch 128 k summarisation → sparse attention saves 70 % VRAM vs dense. -
10 k QPS web service → cloud API still cheaper; use MiniCPM as edge cache for fail-over.
14. Take-away
The release bundle (Apache 2.0) gives you:
-
Training recipe + data pipeline -
Quantised checkpoints (GPTQ/AWQ/GGUF/Bit) -
CUDA kernels (sparse + speculative) -
Gradio demo, Intel app, LoRA templates
In short, everything needed to duplicate the paper numbers is one git clone
away.
Edge AI is no longer a promise—it is a pip package.
15. Cite the work
If you use MiniCPM4 in research:
@article{minicpm4,
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
author={MiniCPM Team},
year={2025}
}
All technical details in this post are extracted from the official repository and technical report; no external knowledge was added.