MiniCPM4 & MiniCPM4.1: A Pocket-Sized 8 B-Parameter Model That Thinks—and Runs—at the Edge

(The no-hype, no-code-dump guide for junior developers, product managers, and tinkerers)


“Can I really run a GPT-3-class model on a lunch-box computer?”
If that question keeps you awake, this article is the sleeping pill. Everything below is copied straight from the official OpenBMB repositories (no extra facts, no fluff). I’ve only translated, re-ordered, and explained the bits that usually stay locked inside research papers.


1. Elevator summary

What Number Why it matters
Model size 8 B parameters Fits a 16 GB RTX 4070 at 16-bit, or a 6 GB card at 4-bit
Context length 128 k tokens Full legal contract in one shot
Speed on Jetson Orin 7× faster than same-size Qwen3-8B Real-time summarisation on 30 W power
New “thinking” switch 3× speed-up when turned off Same weights, zero reload
Licence Apache 2.0 Free for commercial use

2. Four knives that carved out the speed

OpenBMB did not invent a bigger engine; they removed the dead weight.

  1. Knife 1 – Sparse attention (InfLLM v2)
    Every new token attends to <5 % of previous tokens in 128 k text.
    Result: O(n²) → almost linear.

  2. Knife 2 – Predictable scaling (Model Wind-Tunnel 2.0)
    Train a 0.1 B “toy” model first, find the best lr/warm-up/bsz, then copy those numbers to 8 B.
    Cuts training cost ≈50 %.

  3. Knife 3 – 1.58-bit ternary weights (BitCPM)
    Weights ∈ {–1, 0, 1}.
    90 % smaller file, <4 % score drop.

  4. Knife 4 – Inference engine (CPM.cu)
    Fuses sparse kernels, speculative decoding and 8-bit GEMM in one CUDA binary.
    Delivers the paper numbers out of the box.


3. Which flavour fits my machine?

Model File size (GB) Min. VRAM Device example Use case
MiniCPM4.1-8B (bf16) 16 16 GB RTX 4080 Laptop Max accuracy
MiniCPM4.1-8B-GPTQ (4-bit) 4.3 9 GB RTX 3060 12 GB Desktop QA
MiniCPM4.1-8B-GGUF (Q4_K_M) 4.1 6 GB M2 MacBook Air Silent office
MiniCPM4-0.5B (bf16) 1.2 2 GB Raspberry Pi 5 Voice assistant
BitCPM4-0.5B (1.58-bit) 0.3 0.8 GB Cortex-A78 Drone / IoT

4. Install-and-run in three commands

The snippets below are copied 1-to-1 from the GitHub read-me; I only added line comments so you know why you are typing them.

4.1 One-time setup

# 1. Any Python ≥3.9 works
conda create -n minicpm python=3.10 -y
conda activate minicpm

# 2. Torch with CUDA 11.8 (change wheel if you use 12.x)
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html

# 3. Hugging Face libs
pip install transformers==4.43.0 accelerate

4.2 Download weights (example: full-precision 8B)

git lfs install
git clone https://huggingface.co/openbmb/MiniCPM4.1-8B   # 16 GB

4.3 Run your first prompt

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, warnings, time
torch.manual_seed(0)
warnings.filterwarnings("ignore")

model_id = "MiniCPM4.1-8B"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="cuda",
        trust_remote_code=True)

messages = [{"role": "user",
             "content": "Explain quantum entanglement like I'm 12."}]

prompt = tok.apply_chat_template(messages,
                                 add_generation_prompt=True,
                                 tokenize=False)

inputs = tok(prompt, return_tensors="pt").to("cuda")
start = time.time()
ids = model.generate(**inputs,
                     max_new_tokens=400,
                     temperature=0.6,
                     top_p=0.95)
print(tok.decode(ids[0][len(inputs.input_ids[0]):],
                 skip_special_tokens=True))
print(f"\nSpeed: {len(ids[0])/(time.time()-start):.1f} tokens/s")

Typical output on RTX 4090:
Speed: 82.3 tokens/s

That is it—no C++, no CMake, no protobuf pain.


5. The thinking switch—same weights, two personalities

MiniCPM4.1 embeds a special control token. Flip it on the fly:

Mode How to enable When to use
Deep reasoning (default) enable_thinking=True or append /think Math proof, code logic
Fast reply enable_thinking=False or append /no_think Chat, classification

Example:

# Fast reply
quick = tok.apply_chat_template(messages,
                                enable_thinking=False,
                                tokenize=False)

Benchmark (RTX 4090, 128 k input, 512 output):

  • Thinking ON: 18 tokens/s, F1 = 42.3
  • Thinking OFF: 55 tokens/s, F1 = 41.9

You trade 0.4 % quality for 3× speed—usually a good deal.


6. 128 k tokens on a toaster (Jetson Orin demo)

Hardware: Jetson AGX Orin 64 GB, 30 W TDP.
Task: find 20 clauses inside a 100 k token legal PDF.

# 1. Get the optimised inference engine
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu && python setup.py install

# 2. Produce a long prompt (script provided)
python tests/long_prompt_gen.py      # writes prompt.txt

# 3. Fire
python tests/test_generate.py --prompt-file prompt.txt

Result copied from repo:

  • First-token latency 2.1 s (versus 14 s with HF transformers)
  • Sustained 30 tokens/s

Power drawn: 28 W. The fan is quieter than a fridge.


7. Quantisation cookbook (GPTQ vs AWQ vs GGUF)

Format Compression Quality loss Tool Best for
GPTQ 4-bit 50 % <2 % auto-gptq NVIDIA desktop cards
AWQ 4-bit 50 % <2 % awq Same, pick whichever is faster on your GPU
GGUF Q4_K_M 50 % <3 % llama.cpp Apple Silicon, Windows iGPU
BitCPM 1.58-bit 90 % <4 % custom kernel MCU-class edge boards

Quick GPTQ example (also in repo):

# 1. Install auto-gptq
pip install auto-gptq optimum

# 2. One-line quantisation
optimum-cli export gptq-model openbmb/MiniCPM4.1-8B \
                           --output MiniCPM4.1-8B-GPTQ \
                           --bits 4 --group-size 128

Inference then needs only 9 GB VRAM and runs 1.3× faster because memory traffic shrinks.


8. Fine-tune your own assistant with LLaMA-Factory

Data: 5 k instruction rows is enough for style transfer.
Hardware: one RTX 4090 (24 GB) or A6000 (48 GB).
Time: ~3 h.

git clone https://github.com/hiyouga/LLaMA-Factory
cd LLaMA-Factory
pip install -e .

# Edit examples/minicpm_lora_sft.yaml
# model_name_or_path: openbmb/MiniCPM4.1-8B
# template: minicpm
# finetuning_type: lora

llamafactory-cli train examples/minicpm_lora_sft.yaml

Output: a 200 MB LoRA adapter you can hot-plug:

model.load_adapter("saves/minicpm_lora/checkpoint-3000")

No merge needed; base model stays untouched.


9. Long-text ninja moves (RoPE & InfLLM v2)

MiniCPM4.1 ships with 64 k native context.
Need 128 k? Just extend the RoPE factors in config.json:

"rope_scaling": {
  "rope_type": "longrope",
  "long_factor":  [0.998, 1.033, 1.075, ...],   # 64 numbers
  "short_factor": [0.998, 1.033, 1.075, ...],
  "original_max_position_embeddings": 65536
}

Values are already calculated by the authors; copy-paste works.
For >64 k sparse attention, add:

"sparse_config": {
  "kernel_size": 32,
  "kernel_stride": 16,
  "block_size": 64,
  "topk": 64,
  "dense_len": 8192
}

Then install the companion CUDA kernel:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl
cd infllmv2_cuda_impl && pip install -e .

Without this step the model still runs, but falls back to standard attention and gets slower after 64 k.


10. Speed numbers you can quote in meetings

All figures are copied from the official markdown; I only rounded for readability.

Platform Model Context Speed vs Qwen3-8B
Jetson Orin MiniCPM4-8B 128 k 30 t/s
RTX 4090 MiniCPM4.1-8B 8 k 82 t/s 2.8×
M2 Max MiniCPM4.1-8B-GGUF Q4 4 k 35 t/s N/A
RPi 5 MiniCPM4-0.5B 2 k 12 t/s N/A

11. Real-world apps already in the box

  1. MiniCPM4-Survey
    Agent that writes 20-page survey papers from a single user query.
    Planning → Retrieval → Writing multi-agent loop.
    Fact-score 68.7, on par with GPT-4o DeepResearch (paper table).

  2. MiniCPM4-MCP
    Tool-use agent speaking the Model Context Protocol (MCP).
    Supports 16 servers: ArXiv, Slack, GitHub, Maps, Calculator …
    Average tool-call accuracy 88 %, beats GPT-4o (80 %).

  3. MiniCPM Intel AIPC Client
    Windows desktop client tuned for Intel Core Ultra.
    OpenVINO backend, 80 tokens/s on Ultra 7, fanless mode.
    All data stays on the laptop—handy for NDAs.


12. Troubleshooting cheat-sheet (copy-paste ready)

Symptom Likely cause One-line fix
OOM on 24 GB card max_num_batched_tokens too high Set --max-num-batched-tokens 8192
128 k slower than 64 k Sparse kernel not loaded Install infllmv2_cuda_impl
Gibberish after quant Calibration data mismatch Use in-domain 1 k samples
M1 GPU not found CUDA hard-coded Switch to GGUF + mlx_lm
LoRA layers ignored Forgetting load_adapter() Call it after from_pretrained()

13. When to pick MiniCPM and when to stay with cloud APIs

  • Privacy-bound (health, legal, gov) → run MiniCPM4.1-8B locally; GGUF Q4 needs only 6 GB.
  • Sub-200 ms chat → edge 0.5 B model; first token <50 ms on phone SoC.
  • Batch 128 k summarisation → sparse attention saves 70 % VRAM vs dense.
  • 10 k QPS web service → cloud API still cheaper; use MiniCPM as edge cache for fail-over.

14. Take-away

The release bundle (Apache 2.0) gives you:

  • Training recipe + data pipeline
  • Quantised checkpoints (GPTQ/AWQ/GGUF/Bit)
  • CUDA kernels (sparse + speculative)
  • Gradio demo, Intel app, LoRA templates

In short, everything needed to duplicate the paper numbers is one git clone away.
Edge AI is no longer a promise—it is a pip package.


15. Cite the work

If you use MiniCPM4 in research:

@article{minicpm4,
  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
  author={MiniCPM Team},
  year={2025}
}

All technical details in this post are extracted from the official repository and technical report; no external knowledge was added.