MiniCPM4 & MiniCPM4.1: A Pocket-Sized 8 B-Parameter Model That Thinks—and Runs—at the Edge

(The no-hype, no-code-dump guide for junior developers, product managers, and tinkerers)

“Can I really run a GPT-3-class model on a lunch-box computer?”
If that question keeps you awake, this article is the sleeping pill. Everything below is copied straight from the official OpenBMB repositories (no extra facts, no fluff). I’ve only translated, re-ordered, and explained the bits that usually stay locked inside research papers.

1. Elevator summary

What	Number	Why it matters
Model size	8 B parameters	Fits a 16 GB RTX 4070 at 16-bit, or a 6 GB card at 4-bit
Context length	128 k tokens	Full legal contract in one shot
Speed on Jetson Orin	7× faster than same-size Qwen3-8B	Real-time summarisation on 30 W power
New “thinking” switch	3× speed-up when turned off	Same weights, zero reload
Licence	Apache 2.0	Free for commercial use

2. Four knives that carved out the speed

OpenBMB did not invent a bigger engine; they removed the dead weight.

Knife 1 – Sparse attention (InfLLM v2)
Every new token attends to <5 % of previous tokens in 128 k text.
Result: O(n²) → almost linear.
Knife 2 – Predictable scaling (Model Wind-Tunnel 2.0)
Train a 0.1 B “toy” model first, find the best lr/warm-up/bsz, then copy those numbers to 8 B.
Cuts training cost ≈50 %.
Knife 3 – 1.58-bit ternary weights (BitCPM)
Weights ∈ {–1, 0, 1}.
90 % smaller file, <4 % score drop.
Knife 4 – Inference engine (CPM.cu)
Fuses sparse kernels, speculative decoding and 8-bit GEMM in one CUDA binary.
Delivers the paper numbers out of the box.

3. Which flavour fits my machine?

Model	File size (GB)	Min. VRAM	Device example	Use case
MiniCPM4.1-8B (bf16)	16	16 GB	RTX 4080 Laptop	Max accuracy
MiniCPM4.1-8B-GPTQ (4-bit)	4.3	9 GB	RTX 3060 12 GB	Desktop QA
MiniCPM4.1-8B-GGUF (Q4_K_M)	4.1	6 GB	M2 MacBook Air	Silent office
MiniCPM4-0.5B (bf16)	1.2	2 GB	Raspberry Pi 5	Voice assistant
BitCPM4-0.5B (1.58-bit)	0.3	0.8 GB	Cortex-A78	Drone / IoT

4. Install-and-run in three commands

The snippets below are copied 1-to-1 from the GitHub read-me; I only added line comments so you know why you are typing them.

4.1 One-time setup

# 1. Any Python ≥3.9 works
conda create -n minicpm python=3.10 -y
conda activate minicpm

# 2. Torch with CUDA 11.8 (change wheel if you use 12.x)
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html

# 3. Hugging Face libs
pip install transformers==4.43.0 accelerate

4.2 Download weights (example: full-precision 8B)

git lfs install
git clone https://huggingface.co/openbmb/MiniCPM4.1-8B   # 16 GB

4.3 Run your first prompt

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, warnings, time
torch.manual_seed(0)
warnings.filterwarnings("ignore")

model_id = "MiniCPM4.1-8B"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="cuda",
        trust_remote_code=True)

messages = [{"role": "user",
             "content": "Explain quantum entanglement like I'm 12."}]

prompt = tok.apply_chat_template(messages,
                                 add_generation_prompt=True,
                                 tokenize=False)

inputs = tok(prompt, return_tensors="pt").to("cuda")
start = time.time()
ids = model.generate(**inputs,
                     max_new_tokens=400,
                     temperature=0.6,
                     top_p=0.95)
print(tok.decode(ids[0][len(inputs.input_ids[0]):],
                 skip_special_tokens=True))
print(f"\nSpeed: {len(ids[0])/(time.time()-start):.1f} tokens/s")

Typical output on RTX 4090:
Speed: 82.3 tokens/s

That is it—no C++, no CMake, no protobuf pain.

5. The thinking switch—same weights, two personalities

MiniCPM4.1 embeds a special control token. Flip it on the fly:

Mode	How to enable	When to use
Deep reasoning (default)	`enable_thinking=True` or append `/think`	Math proof, code logic
Fast reply	`enable_thinking=False` or append `/no_think`	Chat, classification

Example:

# Fast reply
quick = tok.apply_chat_template(messages,
                                enable_thinking=False,
                                tokenize=False)

Benchmark (RTX 4090, 128 k input, 512 output):

Thinking ON: 18 tokens/s, F1 = 42.3
Thinking OFF: 55 tokens/s, F1 = 41.9

You trade 0.4 % quality for 3× speed—usually a good deal.

6. 128 k tokens on a toaster (Jetson Orin demo)

Hardware: Jetson AGX Orin 64 GB, 30 W TDP.
Task: find 20 clauses inside a 100 k token legal PDF.

# 1. Get the optimised inference engine
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu && python setup.py install

# 2. Produce a long prompt (script provided)
python tests/long_prompt_gen.py      # writes prompt.txt

# 3. Fire
python tests/test_generate.py --prompt-file prompt.txt

Result copied from repo:

First-token latency 2.1 s (versus 14 s with HF transformers)
Sustained 30 tokens/s

Power drawn: 28 W. The fan is quieter than a fridge.

7. Quantisation cookbook (GPTQ vs AWQ vs GGUF)

Format	Compression	Quality loss	Tool	Best for
GPTQ 4-bit	50 %	<2 %	auto-gptq	NVIDIA desktop cards
AWQ 4-bit	50 %	<2 %	awq	Same, pick whichever is faster on your GPU
GGUF Q4_K_M	50 %	<3 %	llama.cpp	Apple Silicon, Windows iGPU
BitCPM 1.58-bit	90 %	<4 %	custom kernel	MCU-class edge boards

Quick GPTQ example (also in repo):

# 1. Install auto-gptq
pip install auto-gptq optimum

# 2. One-line quantisation
optimum-cli export gptq-model openbmb/MiniCPM4.1-8B \
                           --output MiniCPM4.1-8B-GPTQ \
                           --bits 4 --group-size 128

Inference then needs only 9 GB VRAM and runs 1.3× faster because memory traffic shrinks.

8. Fine-tune your own assistant with LLaMA-Factory

Data: 5 k instruction rows is enough for style transfer.
Hardware: one RTX 4090 (24 GB) or A6000 (48 GB).
Time: ~3 h.

git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory
pip install -e .

# Edit examples/minicpm_lora_sft.yaml
# model_name_or_path: openbmb/MiniCPM4.1-8B
# template: minicpm
# finetuning_type: lora

llamafactory-cli train examples/minicpm_lora_sft.yaml

Output: a 200 MB LoRA adapter you can hot-plug:

model.load_adapter("saves/minicpm_lora/checkpoint-3000")

No merge needed; base model stays untouched.

9. Long-text ninja moves (RoPE & InfLLM v2)

MiniCPM4.1 ships with 64 k native context.
Need 128 k? Just extend the RoPE factors in config.json:

"rope_scaling": {
  "rope_type": "longrope",
  "long_factor":  [0.998, 1.033, 1.075, ...],   # 64 numbers
  "short_factor": [0.998, 1.033, 1.075, ...],
  "original_max_position_embeddings": 65536
}

Values are already calculated by the authors; copy-paste works.
For >64 k sparse attention, add:

"sparse_config": {
  "kernel_size": 32,
  "kernel_stride": 16,
  "block_size": 64,
  "topk": 64,
  "dense_len": 8192
}

Then install the companion CUDA kernel:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl
cd infllmv2_cuda_impl && pip install -e .

Without this step the model still runs, but falls back to standard attention and gets slower after 64 k.

10. Speed numbers you can quote in meetings

All figures are copied from the official markdown; I only rounded for readability.

Platform	Model	Context	Speed	vs Qwen3-8B
Jetson Orin	MiniCPM4-8B	128 k	30 t/s	7×
RTX 4090	MiniCPM4.1-8B	8 k	82 t/s	2.8×
M2 Max	MiniCPM4.1-8B-GGUF Q4	4 k	35 t/s	N/A
RPi 5	MiniCPM4-0.5B	2 k	12 t/s	N/A

11. Real-world apps already in the box

MiniCPM4-Survey
Agent that writes 20-page survey papers from a single user query.
Planning → Retrieval → Writing multi-agent loop.
Fact-score 68.7, on par with GPT-4o DeepResearch (paper table).
MiniCPM4-MCP
Tool-use agent speaking the Model Context Protocol (MCP).
Supports 16 servers: ArXiv, Slack, GitHub, Maps, Calculator …
Average tool-call accuracy 88 %, beats GPT-4o (80 %).
MiniCPM Intel AIPC Client
Windows desktop client tuned for Intel Core Ultra.
OpenVINO backend, 80 tokens/s on Ultra 7, fanless mode.
All data stays on the laptop—handy for NDAs.

12. Troubleshooting cheat-sheet (copy-paste ready)

Symptom	Likely cause	One-line fix
OOM on 24 GB card	`max_num_batched_tokens` too high	Set `--max-num-batched-tokens 8192`
128 k slower than 64 k	Sparse kernel not loaded	Install `infllmv2_cuda_impl`
Gibberish after quant	Calibration data mismatch	Use in-domain 1 k samples
M1 GPU not found	CUDA hard-coded	Switch to GGUF + `mlx_lm`
LoRA layers ignored	Forgetting `load_adapter()`	Call it after `from_pretrained()`

13. When to pick MiniCPM and when to stay with cloud APIs

Privacy-bound (health, legal, gov) → run MiniCPM4.1-8B locally; GGUF Q4 needs only 6 GB.
Sub-200 ms chat → edge 0.5 B model; first token <50 ms on phone SoC.
Batch 128 k summarisation → sparse attention saves 70 % VRAM vs dense.
10 k QPS web service → cloud API still cheaper; use MiniCPM as edge cache for fail-over.

14. Take-away

The release bundle (Apache 2.0) gives you:

Training recipe + data pipeline
Quantised checkpoints (GPTQ/AWQ/GGUF/Bit)
CUDA kernels (sparse + speculative)
Gradio demo, Intel app, LoRA templates

In short, everything needed to duplicate the paper numbers is one git clone away.
Edge AI is no longer a promise—it is a pip package.

15. Cite the work

If you use MiniCPM4 in research:

@article{minicpm4,
  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

All technical details in this post are extracted from the official repository and technical report; no external knowledge was added.

MiniCPM4 Revealed: How Edge Devices Run GPT-3-Class Models at 30W Power