Xiaomi Open-Sources MiMo-VL-7B: A 7-Billion-Parameter Vision-Language Model That Outperforms 70-B+ Giants
“
“I want my computer to understand images, videos, and even control my desktop—without renting a data-center.”
If that sounds like you, Xiaomi’s freshly-released MiMo-VL-7B family might be the sweet spot.
Below is a 20-minute read that turns the 50-page technical report into plain English: what it is, why it matters, how to run it, and what you can build next.”
TL;DR Quick Facts
Capability | Score | Benchmark Leader? | What it means for you |
---|---|---|---|
University-level multi-discipline Q&A (MMMU) | 70.6 | #1 among 7B–72B open models | Reads textbooks, charts, slides |
Video understanding (Video-MME) | 70.8 | Top-3 open, any size | Summarizes 2-min clips at 2 FPS |
GUI control (OSWorld-G) | 56.1 | Beats UI-specialized UI-TARS | Automates desktop/mobile apps |
Math Olympiad (OlympiadBench) | 59.4 | >72B Qwen2.5-VL (+22 pts) | Solves competition-level problems |
Model size | 7 B parameters | Fits RTX 4090 int4 | Local & offline |
1. What exactly is MiMo-VL-7B?
Three Lego-like parts:
-
Vision Encoder (Qwen2.5-ViT) -
Native-resolution input—no down-scaling, no square cropping -
Handles 4096×28×28 visual tokens without distortion
-
-
MLP Projector -
Learns to map image tokens → language space -
Randomly initialized, then trained in stage 1
-
-
Language Backbone (MiMo-7B) -
36 layers, 4096 hidden dims, optimized for reasoning -
Released earlier by Xiaomi for pure-text tasks
-
Think of it as “eyes + translator + brain” in one package.
2. Training Pipeline in One Glance
Stage | Goal | Tokens | Unfrozen Modules | Key Data Mix |
---|---|---|---|---|
1 Warm-up | Teach projector to speak | 300 B | MLP only | Image-caption pairs |
2 Alignment | Sync ViT & LLM | 167 B | ViT + MLP | Interleaved web pages |
3 General Pre-training | Add OCR, video, GUI | 1.4 T | All | OCR, video, GUI, synthetic reasoning |
4 Long-Context SFT | Stretch to 32 k tokens | 550 B | All | Long docs, long reasoning chains |
Total: 2.4 T tokens across 4 stages.
3. Data Diet: How Xiaomi Fed the Model
3.1 Image-Caption Pairs
-
Start with open captions → deduplicate → regenerate with a stronger caption model → quality filter -
Result: balanced, high-detail descriptions in both Chinese & English
3.2 OCR & Grounding
-
Scanned docs, handwriting, warped text, blurry labels—bounding boxes included -
Single-object & multi-object scenes with complex referential phrases
3.3 Video Corpus
-
2 FPS sampling → dense, timestamped captions -
Global summaries (“narrative arc, style, intent”) -
Synthetic Q&A pairs for dialog-style understanding
3.4 GUI Data
-
Real screenshots from Android, iOS, Web, Desktop -
Synthetic engine fills gaps: Chinese UI, rare widgets -
Two tasks: -
Element grounding → “Click the ‘Settings’ icon.” -
Action prediction → “Given before/after screenshots, what tap/scroll/key happened?”
-
3.5 Synthetic Reasoning Chains
-
Harvest open QA → regenerate answers with step-by-step reasoning → multi-stage correctness filter -
Inserted directly into pre-training (not just fine-tuning) → deeper reasoning without saturation
4. Post-Training: MORL (Mixed On-Policy RL)
After pre-training, the model enters a single RL phase that blends four objectives:
Objective | Reward Source | Example |
---|---|---|
Verifiable reasoning | Rule scripts | Math problems auto-checked by math-verify |
Human preference | Reward model | Side-by-side human votes |
Visual grounding | GIoU or point-in-box | RefCOCO-style tasks |
Counting / temporal grounding | Exact-match or IoU | “How many muffins?” or “00:23–00:47 clip” |
On-policy GRPO variant
-
Only the newest model generates training samples -
Removes KL penalty → stabler updates -
Reward-as-a-Service (micro-service) → <1 ms reward latency
5. Benchmark Highlights (Human Translation)
Task Type | Dataset | MiMo-VL-7B-RL | Closest Open Rival | Delta |
---|---|---|---|---|
Multi-discipline QA | MMMU | 70.6 | Qwen2.5-VL-72B 58.6 | +12.0 |
Chart & Doc QA | CharXiv-RQ | 56.5 | InternVL3-78B 37.6 | +18.9 |
Video QA | Video-MME | 70.8 | InternVL3-78B 66.3 | +4.5 |
GUI grounding | OSWorld-G | 56.1 | UI-TARS 49.6 | +6.5 |
Math reasoning | OlympiadBench | 59.4 | Qwen2.5-VL-72B 37.2 | +22.2 |
Elo user rating | Internal Arena | 1131 | Previous best 1093 | +38 |
6. Hands-On Guide: Run It in 10 Minutes
6.1 Hardware
-
Minimum: RTX 4090 24 GB (int4) -
Sweet spot: A100 40 GB (fp16) -
CPU-only: Not recommended (2+ min / image)
6.2 Install
# 1. Clone
git clone https://github.com/XiaomiMiMo/MiMo-VL.git
cd MiMo-VL
# 2. Dependencies
pip install -r requirements.txt # torch>=2.1, transformers>=4.39
# 3. Download checkpoints (choose one)
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-RL-2508 --local-dir ./checkpoints/rl
# or
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-SFT-2508 --local-dir ./checkpoints/sft
6.3 Gradio Demo
python demo/app.py --model_path ./checkpoints/rl --port 7860
Open http://localhost:7860
, drag-drop an image, type a question.
6.4 Python API
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints/rl", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/rl")
image_path = "invoice.jpg"
prompt = "What is the total amount?"
response = model.chat(tokenizer, query=prompt, image=image_path)
print(response)
7. Thinking Toggle: When You Do (or Don’t) Want the Brain Dump
Mode | Trigger | Use Case |
---|---|---|
Think mode (default) | No tag | Step-by-step reasoning visible |
No-think mode | Add /no_think to prompt |
Fast answers, no chain-of-thought |
Example:
"Identify the text in the image. /no_think"
→ returns text instantly.
8. FAQ: 10 Real-World Questions Answered
Q1: Can I fine-tune it for my vertical?
A: Yes. Use the SFT checkpoint + LoRA scripts in scripts/lora_finetune.py
.
Q2: License for commercial use?
A: Apache-2.0. Keep attribution and you’re good.
Q3: Chinese language support?
A: Training data is 50 % Chinese, 50 % English. Handles mixed input natively.
Q4: Batch-processing PDFs?
A: Included example: examples/batch_pdf.py
uses PyMuPDF to convert pages → images → JSON lines.
Q5: Longest video?
A: 256 frames @ 2 FPS ≈ 2 min. For longer clips, sample key frames every N seconds.
Q6: Does it run on CPU?
A: Technically yes (device_map="cpu"
), expect 5–10× slower than GPU.
Q7: GUI automation on mobile?
A: Requires Android ADB or iOS XCTest bridge; the model outputs JSON actions ready for injection.
Q8: Int4 vs. fp16 trade-off?
A: Int4 drops VRAM from ~14 GB → 8 GB; negligible quality loss on MMMU (-0.3 pts).
Q9: Any 13B or 34B plans?
A: Not announced. Community poll is live on GitHub Discussions.
Q10: Privacy of reasoning chains?
A: Fully local. Nothing is sent back to Xiaomi servers.
9. Case Study Walk-throughs
9.1 Turn a Messy Invoice into JSON
Prompt:
"Extract vendor name, total amount, and tax in JSON format."
Output:
{
"vendor": "ABC Supplies Ltd.",
"total": 112589.52,
"tax": 12952.78
}
9.2 Summarize a 2-Minute Product Video
Upload a Xiaomi SU7 promo clip.
Prompt:
"List the top 3 selling points mentioned visually."
Answer (abridged):
-
Sleek aerodynamic silhouette -
Lightning-fast acceleration (0-100 km/h in 2.78 s) -
Full-scene smart cockpit
9.3 GUI Scripting: Add Item to Cart
Prompt:
"Add the red Xiaomi SU7 to the wishlist."
Model returns:
[
{"action": "click", "start_point": [937, 1181]},
{"action": "click", "start_point": [1677, 1389]},
{"action": "click", "start_point": [1728, 1421]}
]
Feed these into an automation bridge (Appium/ADB) and the car lands in your wishlist.
10. Roadmap & Contribution Paths
-
Weights & Data: Already on Hugging Face & ModelScope -
Evaluation Harness: Open-sourced at github.com/XiaomiMiMo/lmms-eval
-
LoRA Recipes: 8 downstream tasks (chart QA, receipt OCR, GUI agent) -
Community Wishlist: -
4-bit GGUF for llama.cpp -
ONNX export for edge devices -
iOS Shortcuts action
-
Pull requests welcome—Apache-2.0 makes it easy.
11. Citation
If you build on MiMo-VL-7B, please cite:
@misc{coreteam2025mimovltechnicalreport,
title={MiMo-VL Technical Report},
author={LLM-Core-Team Xiaomi},
year={2025},
eprint={2506.03569},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.03569},
}
12. Final Takeaway
MiMo-VL-7B proves you don’t need a data-center-sized GPU to get data-center-grade vision reasoning.
Download the weights, spin up the Gradio demo, and see how far a 7-billion-parameter brain can take you—no cloud bill attached.