Site icon Efficient Coder

MiMo-VL-7B: Xiaomi’s 7B Open-Source Vision-Language Model Beats 70B+ Giants

Xiaomi Open-Sources MiMo-VL-7B: A 7-Billion-Parameter Vision-Language Model That Outperforms 70-B+ Giants

“I want my computer to understand images, videos, and even control my desktop—without renting a data-center.”
If that sounds like you, Xiaomi’s freshly-released MiMo-VL-7B family might be the sweet spot.
Below is a 20-minute read that turns the 50-page technical report into plain English: what it is, why it matters, how to run it, and what you can build next.


TL;DR Quick Facts

Capability Score Benchmark Leader? What it means for you
University-level multi-discipline Q&A (MMMU) 70.6 #1 among 7B–72B open models Reads textbooks, charts, slides
Video understanding (Video-MME) 70.8 Top-3 open, any size Summarizes 2-min clips at 2 FPS
GUI control (OSWorld-G) 56.1 Beats UI-specialized UI-TARS Automates desktop/mobile apps
Math Olympiad (OlympiadBench) 59.4 >72B Qwen2.5-VL (+22 pts) Solves competition-level problems
Model size 7 B parameters Fits RTX 4090 int4 Local & offline

1. What exactly is MiMo-VL-7B?

Three Lego-like parts:

  1. Vision Encoder (Qwen2.5-ViT)
    • Native-resolution input—no down-scaling, no square cropping
    • Handles 4096×28×28 visual tokens without distortion
  2. MLP Projector
    • Learns to map image tokens → language space
    • Randomly initialized, then trained in stage 1
  3. Language Backbone (MiMo-7B)
    • 36 layers, 4096 hidden dims, optimized for reasoning
    • Released earlier by Xiaomi for pure-text tasks

Think of it as “eyes + translator + brain” in one package.


2. Training Pipeline in One Glance

Stage Goal Tokens Unfrozen Modules Key Data Mix
1 Warm-up Teach projector to speak 300 B MLP only Image-caption pairs
2 Alignment Sync ViT & LLM 167 B ViT + MLP Interleaved web pages
3 General Pre-training Add OCR, video, GUI 1.4 T All OCR, video, GUI, synthetic reasoning
4 Long-Context SFT Stretch to 32 k tokens 550 B All Long docs, long reasoning chains

Total: 2.4 T tokens across 4 stages.


3. Data Diet: How Xiaomi Fed the Model

3.1 Image-Caption Pairs

  • Start with open captions → deduplicate → regenerate with a stronger caption model → quality filter
  • Result: balanced, high-detail descriptions in both Chinese & English

3.2 OCR & Grounding

  • Scanned docs, handwriting, warped text, blurry labels—bounding boxes included
  • Single-object & multi-object scenes with complex referential phrases

3.3 Video Corpus

  • 2 FPS sampling → dense, timestamped captions
  • Global summaries (“narrative arc, style, intent”)
  • Synthetic Q&A pairs for dialog-style understanding

3.4 GUI Data

  • Real screenshots from Android, iOS, Web, Desktop
  • Synthetic engine fills gaps: Chinese UI, rare widgets
  • Two tasks:
    • Element grounding → “Click the ‘Settings’ icon.”
    • Action prediction → “Given before/after screenshots, what tap/scroll/key happened?”

3.5 Synthetic Reasoning Chains

  • Harvest open QA → regenerate answers with step-by-step reasoning → multi-stage correctness filter
  • Inserted directly into pre-training (not just fine-tuning) → deeper reasoning without saturation

4. Post-Training: MORL (Mixed On-Policy RL)

After pre-training, the model enters a single RL phase that blends four objectives:

Objective Reward Source Example
Verifiable reasoning Rule scripts Math problems auto-checked by math-verify
Human preference Reward model Side-by-side human votes
Visual grounding GIoU or point-in-box RefCOCO-style tasks
Counting / temporal grounding Exact-match or IoU “How many muffins?” or “00:23–00:47 clip”

On-policy GRPO variant

  • Only the newest model generates training samples
  • Removes KL penalty → stabler updates
  • Reward-as-a-Service (micro-service) → <1 ms reward latency

5. Benchmark Highlights (Human Translation)

Task Type Dataset MiMo-VL-7B-RL Closest Open Rival Delta
Multi-discipline QA MMMU 70.6 Qwen2.5-VL-72B 58.6 +12.0
Chart & Doc QA CharXiv-RQ 56.5 InternVL3-78B 37.6 +18.9
Video QA Video-MME 70.8 InternVL3-78B 66.3 +4.5
GUI grounding OSWorld-G 56.1 UI-TARS 49.6 +6.5
Math reasoning OlympiadBench 59.4 Qwen2.5-VL-72B 37.2 +22.2
Elo user rating Internal Arena 1131 Previous best 1093 +38

6. Hands-On Guide: Run It in 10 Minutes

6.1 Hardware

  • Minimum: RTX 4090 24 GB (int4)
  • Sweet spot: A100 40 GB (fp16)
  • CPU-only: Not recommended (2+ min / image)

6.2 Install

# 1. Clone
git clone https://github.com/XiaomiMiMo/MiMo-VL.git
cd MiMo-VL

# 2. Dependencies
pip install -r requirements.txt  # torch>=2.1, transformers>=4.39

# 3. Download checkpoints (choose one)
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-RL-2508 --local-dir ./checkpoints/rl
# or
huggingface-cli download XiaomiMiMo/MiMo-VL-7B-SFT-2508 --local-dir ./checkpoints/sft

6.3 Gradio Demo

python demo/app.py --model_path ./checkpoints/rl --port 7860

Open http://localhost:7860, drag-drop an image, type a question.

6.4 Python API

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints/rl", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/rl")

image_path = "invoice.jpg"
prompt = "What is the total amount?"
response = model.chat(tokenizer, query=prompt, image=image_path)
print(response)

7. Thinking Toggle: When You Do (or Don’t) Want the Brain Dump

Mode Trigger Use Case
Think mode (default) No tag Step-by-step reasoning visible
No-think mode Add /no_think to prompt Fast answers, no chain-of-thought

Example:
"Identify the text in the image. /no_think" → returns text instantly.


8. FAQ: 10 Real-World Questions Answered

Q1: Can I fine-tune it for my vertical?
A: Yes. Use the SFT checkpoint + LoRA scripts in scripts/lora_finetune.py.

Q2: License for commercial use?
A: Apache-2.0. Keep attribution and you’re good.

Q3: Chinese language support?
A: Training data is 50 % Chinese, 50 % English. Handles mixed input natively.

Q4: Batch-processing PDFs?
A: Included example: examples/batch_pdf.py uses PyMuPDF to convert pages → images → JSON lines.

Q5: Longest video?
A: 256 frames @ 2 FPS ≈ 2 min. For longer clips, sample key frames every N seconds.

Q6: Does it run on CPU?
A: Technically yes (device_map="cpu"), expect 5–10× slower than GPU.

Q7: GUI automation on mobile?
A: Requires Android ADB or iOS XCTest bridge; the model outputs JSON actions ready for injection.

Q8: Int4 vs. fp16 trade-off?
A: Int4 drops VRAM from ~14 GB → 8 GB; negligible quality loss on MMMU (-0.3 pts).

Q9: Any 13B or 34B plans?
A: Not announced. Community poll is live on GitHub Discussions.

Q10: Privacy of reasoning chains?
A: Fully local. Nothing is sent back to Xiaomi servers.


9. Case Study Walk-throughs

9.1 Turn a Messy Invoice into JSON

invoice

Prompt:
"Extract vendor name, total amount, and tax in JSON format."

Output:

{
  "vendor": "ABC Supplies Ltd.",
  "total": 112589.52,
  "tax": 12952.78
}

9.2 Summarize a 2-Minute Product Video

Upload a Xiaomi SU7 promo clip.
Prompt:
"List the top 3 selling points mentioned visually."

Answer (abridged):

  1. Sleek aerodynamic silhouette
  2. Lightning-fast acceleration (0-100 km/h in 2.78 s)
  3. Full-scene smart cockpit

9.3 GUI Scripting: Add Item to Cart

gui

Prompt:
"Add the red Xiaomi SU7 to the wishlist."

Model returns:

[
  {"action": "click", "start_point": [937, 1181]},
  {"action": "click", "start_point": [1677, 1389]},
  {"action": "click", "start_point": [1728, 1421]}
]

Feed these into an automation bridge (Appium/ADB) and the car lands in your wishlist.


10. Roadmap & Contribution Paths

  • Weights & Data: Already on Hugging Face & ModelScope
  • Evaluation Harness: Open-sourced at github.com/XiaomiMiMo/lmms-eval
  • LoRA Recipes: 8 downstream tasks (chart QA, receipt OCR, GUI agent)
  • Community Wishlist:
    • 4-bit GGUF for llama.cpp
    • ONNX export for edge devices
    • iOS Shortcuts action

Pull requests welcome—Apache-2.0 makes it easy.


11. Citation

If you build on MiMo-VL-7B, please cite:

@misc{coreteam2025mimovltechnicalreport,
  title={MiMo-VL Technical Report},
  author={LLM-Core-Team Xiaomi},
  year={2025},
  eprint={2506.03569},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.03569},
}

12. Final Takeaway

MiMo-VL-7B proves you don’t need a data-center-sized GPU to get data-center-grade vision reasoning.
Download the weights, spin up the Gradio demo, and see how far a 7-billion-parameter brain can take you—no cloud bill attached.

Exit mobile version