MiMo-VL-7B: Xiaomi’s 7B Open-Source Vision-Language Model Beats 70B+ Giants

高效码农

3 months ago

Xiaomi Open-Sources MiMo-VL-7B: A 7-Billion-Parameter Vision-Language Model That Outperforms 70-B+ Giants

“

“I want my computer to understand images, videos, and even control my desktop—without renting a data-center.”
If that sounds like you, Xiaomi’s freshly-released MiMo-VL-7B family might be the sweet spot.
Below is a 20-minute read that turns the 50-page technical report into plain English: what it is, why it matters, how to run it, and what you can build next.

”

TL;DR Quick Facts

Capability	Score	Benchmark Leader?	What it means for you
University-level multi-discipline Q&A (MMMU)	70.6	#1 among 7B–72B open models	Reads textbooks, charts, slides
Video understanding (Video-MME)	70.8	Top-3 open, any size	Summarizes 2-min clips at 2 FPS
GUI control (OSWorld-G)	56.1	Beats UI-specialized UI-TARS	Automates desktop/mobile apps
Math Olympiad (OlympiadBench)	59.4	>72B Qwen2.5-VL (+22 pts)	Solves competition-level problems
Model size	7 B parameters	Fits RTX 4090 int4	Local & offline

1. What exactly is MiMo-VL-7B?

Three Lego-like parts:

Vision Encoder (Qwen2.5-ViT)
- Native-resolution input—no down-scaling, no square cropping
- Handles 4096×28×28 visual tokens without distortion
MLP Projector
- Learns to map image tokens → language space
- Randomly initialized, then trained in stage 1
Language Backbone (MiMo-7B)
- 36 layers, 4096 hidden dims, optimized for reasoning
- Released earlier by Xiaomi for pure-text tasks

Think of it as “eyes + translator + brain” in one package.

2. Training Pipeline in One Glance

Stage	Goal	Tokens	Unfrozen Modules	Key Data Mix
1 Warm-up	Teach projector to speak	300 B	MLP only	Image-caption pairs
2 Alignment	Sync ViT & LLM	167 B	ViT + MLP	Interleaved web pages
3 General Pre-training	Add OCR, video, GUI	1.4 T	All	OCR, video, GUI, synthetic reasoning
4 Long-Context SFT	Stretch to 32 k tokens	550 B	All	Long docs, long reasoning chains

Total: 2.4 T tokens across 4 stages.

3. Data Diet: How Xiaomi Fed the Model

3.1 Image-Caption Pairs

Start with open captions → deduplicate → regenerate with a stronger caption model → quality filter
Result: balanced, high-detail descriptions in both Chinese & English

3.2 OCR & Grounding

Scanned docs, handwriting, warped text, blurry labels—bounding boxes included
Single-object & multi-object scenes with complex referential phrases

3.3 Video Corpus

2 FPS sampling → dense, timestamped captions
Global summaries (“narrative arc, style, intent”)
Synthetic Q&A pairs for dialog-style understanding

3.4 GUI Data

Real screenshots from Android, iOS, Web, Desktop
Synthetic engine fills gaps: Chinese UI, rare widgets
Two tasks:
- Element grounding → “Click the ‘Settings’ icon.”
- Action prediction → “Given before/after screenshots, what tap/scroll/key happened?”

3.5 Synthetic Reasoning Chains

Harvest open QA → regenerate answers with step-by-step reasoning → multi-stage correctness filter
Inserted directly into pre-training (not just fine-tuning) → deeper reasoning without saturation

4. Post-Training: MORL (Mixed On-Policy RL)

After pre-training, the model enters a single RL phase that blends four objectives:

Objective	Reward Source	Example
Verifiable reasoning	Rule scripts	Math problems auto-checked by `math-verify`
Human preference	Reward model	Side-by-side human votes
Visual grounding	GIoU or point-in-box	RefCOCO-style tasks
Counting / temporal grounding	Exact-match or IoU	“How many muffins?” or “00:23–00:47 clip”

On-policy GRPO variant

Only the newest model generates training samples
Removes KL penalty → stabler updates
Reward-as-a-Service (micro-service) → <1 ms reward latency

5. Benchmark Highlights (Human Translation)

Task Type	Dataset	MiMo-VL-7B-RL	Closest Open Rival	Delta
Multi-discipline QA	MMMU	70.6	Qwen2.5-VL-72B 58.6	+12.0
Chart & Doc QA	CharXiv-RQ	56.5	InternVL3-78B 37.6	+18.9
Video QA	Video-MME	70.8	InternVL3-78B 66.3	+4.5
GUI grounding	OSWorld-G	56.1	UI-TARS 49.6	+6.5
Math reasoning	OlympiadBench	59.4	Qwen2.5-VL-72B 37.2	+22.2
Elo user rating	Internal Arena	1131	Previous best 1093	+38

6. Hands-On Guide: Run It in 10 Minutes

6.1 Hardware

Minimum: RTX 4090 24 GB (int4)
Sweet spot: A100 40 GB (fp16)
CPU-only: Not recommended (2+ min / image)

6.2 Install

# 1. Clone
git clone https://github.com/XiaomiMiMo/MiMo-VL.git
cd MiMo-VL

# 2. Dependencies
pip install -r requirements.txt  # torch>=2.1, transformers>=4.39

# 3. Download checkpoints (choose one)
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-RL-2508 --local-dir ./checkpoints/rl
# or
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-SFT-2508 --local-dir ./checkpoints/sft

6.3 Gradio Demo

python demo/app.py --model_path ./checkpoints/rl --port 7860

Open http://localhost:7860, drag-drop an image, type a question.

6.4 Python API

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints/rl", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/rl")

image_path = "invoice.jpg"
prompt = "What is the total amount?"
response = model.chat(tokenizer, query=prompt, image=image_path)
print(response)

7. Thinking Toggle: When You Do (or Don’t) Want the Brain Dump

Mode	Trigger	Use Case
Think mode (default)	No tag	Step-by-step reasoning visible
No-think mode	Add `/no_think` to prompt	Fast answers, no chain-of-thought

Example:
"Identify the text in the image. /no_think" → returns text instantly.

8. FAQ: 10 Real-World Questions Answered

Q1: Can I fine-tune it for my vertical?
A: Yes. Use the SFT checkpoint + LoRA scripts in scripts/lora_finetune.py.

Q2: License for commercial use?
A: Apache-2.0. Keep attribution and you’re good.

Q3: Chinese language support?
A: Training data is 50 % Chinese, 50 % English. Handles mixed input natively.

Q4: Batch-processing PDFs?
A: Included example: examples/batch_pdf.py uses PyMuPDF to convert pages → images → JSON lines.

Q5: Longest video?
A: 256 frames @ 2 FPS ≈ 2 min. For longer clips, sample key frames every N seconds.

Q6: Does it run on CPU?
A: Technically yes (device_map="cpu"), expect 5–10× slower than GPU.

Q7: GUI automation on mobile?
A: Requires Android ADB or iOS XCTest bridge; the model outputs JSON actions ready for injection.

Q8: Int4 vs. fp16 trade-off?
A: Int4 drops VRAM from ~14 GB → 8 GB; negligible quality loss on MMMU (-0.3 pts).

Q9: Any 13B or 34B plans?
A: Not announced. Community poll is live on GitHub Discussions.

Q10: Privacy of reasoning chains?
A: Fully local. Nothing is sent back to Xiaomi servers.

9. Case Study Walk-throughs

9.1 Turn a Messy Invoice into JSON

Prompt:
"Extract vendor name, total amount, and tax in JSON format."

Output:

{
  "vendor": "ABC Supplies Ltd.",
  "total": 112589.52,
  "tax": 12952.78
}

9.2 Summarize a 2-Minute Product Video

Upload a Xiaomi SU7 promo clip.
Prompt:
"List the top 3 selling points mentioned visually."

Answer (abridged):

Sleek aerodynamic silhouette
Lightning-fast acceleration (0-100 km/h in 2.78 s)
Full-scene smart cockpit

9.3 GUI Scripting: Add Item to Cart

Prompt:
"Add the red Xiaomi SU7 to the wishlist."

Model returns:

[
  {"action": "click", "start_point": [937, 1181]},
  {"action": "click", "start_point": [1677, 1389]},
  {"action": "click", "start_point": [1728, 1421]}
]

Feed these into an automation bridge (Appium/ADB) and the car lands in your wishlist.

10. Roadmap & Contribution Paths

Weights & Data: Already on Hugging Face & ModelScope
Evaluation Harness: Open-sourced at github.com/XiaomiMiMo/lmms-eval
LoRA Recipes: 8 downstream tasks (chart QA, receipt OCR, GUI agent)
Community Wishlist:
- 4-bit GGUF for llama.cpp
- ONNX export for edge devices
- iOS Shortcuts action

Pull requests welcome—Apache-2.0 makes it easy.

11. Citation

If you build on MiMo-VL-7B, please cite:

@misc{coreteam2025mimovltechnicalreport,
  title={MiMo-VL Technical Report},
  author={LLM-Core-Team Xiaomi},
  year={2025},
  eprint={2506.03569},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.03569},
}

12. Final Takeaway

MiMo-VL-7B proves you don’t need a data-center-sized GPU to get data-center-grade vision reasoning.
Download the weights, spin up the Gradio demo, and see how far a 7-billion-parameter brain can take you—no cloud bill attached.