Xiaomi MiMo-V2-Flash: Deep Dive into the 309B Parameter Efficient AI Model

Summary: Xiaomi’s MiMo-V2-Flash is a Mixture-of-Experts language model featuring 309B total parameters with only 15B active parameters, achieving 6× KV cache compression through 128-token sliding window attention, reaching 73.4% resolution rate on SWE-Bench Verified, delivering 2.6× inference speedup, making it the most efficient open-source code agent model available today.

Why Are AI Models Getting Slower Despite Growing Larger?

When using ChatGPT or other AI assistants, you might notice an intriguing paradox: models keep getting more powerful, yet response times don’t seem to improve proportionally. What’s behind this phenomenon? Xiaomi’s newly released MiMo-V2-Flash offers a different answer—instead of making models bigger, make them smarter.

This model achieves genuine “fast and capable” performance while maintaining robust capabilities. It excels at practical tasks like code debugging, mathematical reasoning, and multilingual programming, while delivering 2.6× faster inference compared to traditional models. How is this possible?

What Makes MiMo-V2-Flash Fundamentally Different?

Hybrid Attention Architecture: Focusing Like Humans Do

Traditional AI models processing long texts need to “remember” the relationship between every word and every other word. Imagine reading a 300-page book while remembering how each word on page 1 relates to every word on page 300—clearly unrealistic and unnecessary.

MiMo-V2-Flash adopts an approach closer to human cognition:

Sliding Window Attention (SWA): The model focuses only on the most recent 128 tokens (approximately 100 Chinese characters or 50-70 English words). This mirrors how you primarily focus on the current paragraph while reading.

Global Attention (GA): After every 5 sliding window layers, 1 global attention layer is inserted, allowing the model to “review” key information from the entire context.

This 5:1 hybrid ratio delivers significant efficiency gains:

Nearly 6× reduction in KV cache storage: Dramatically lower memory footprint for long texts
Nearly 6× reduction in attention computation: Noticeably faster processing
Enhanced long-context understanding: Retrieval accuracy approaching 100% across 32K to 256K context lengths

Attention Sink Bias: Teaching Models to Ignore

More ingeniously, MiMo-V2-Flash introduces learnable attention sink bias mechanisms. This enables the model to autonomously decide which information can be “safely ignored,” rather than being forced to treat all information equally.

Technically, the model adds a learnable parameter sink to the softmax denominator, transforming the attention weight calculation to:

Attention Weight = exp(attention_score - max_value) / [exp(sink - max_value) + Σexp(other_scores - max_value)]

This simple improvement not only prevents performance degradation in the 128-token window model but actually surpasses full-attention baseline models on certain tasks.

Mixture-of-Experts Architecture: 256 Experts Working in Harmony

MiMo-V2-Flash employs a sparse Mixture-of-Experts (MoE) architecture, comparable to a team of 256 “experts” where only 8 are activated for each task.

Detailed Configuration Parameters

Main Model Structure:

Total layers: 48 (39 sliding window + 9 global attention)
MoE layers: 256 experts per layer, 8 activated per token
Query heads: 64 (identical for SWA and GA)
Key-value heads: 8 for SWA, 4 for GA
Head dimensions: 192 for queries/keys, 128 for values
Total parameters: 309B
Active parameters: 15B

This design delivers two critical advantages:

High parameter efficiency: Despite 309B total parameters, only 15B activate during inference—achieving large-model capabilities at small-model computational costs
Specialized division of labor: Different experts focus on different task types (mathematical reasoning, code generation, creative writing), called precisely when needed

Multi-Token Prediction: Thinking Three Steps Ahead

Traditional AI models resemble novice chess players, thinking only one move ahead. MiMo-V2-Flash operates like experienced grandmasters, simultaneously predicting the next 3 moves.

How Multi-Token Prediction (MTP) Works

The model features 3 lightweight MTP modules, each containing only 0.33B parameters, utilizing:

Sliding window attention (window size 128)
Dense feed-forward networks (not MoE)
Shared embedding and output layers

In practical applications, MTP acceleration varies by task:

Task Type	Next Token Cross-Entropy	Average Accept Length
Web Development	0.05	3.6
Code Generation	0.12	3.2
Mathematical Reasoning	0.18	3.0
Scientific Programming	0.20	2.9

Higher predictive certainty (lower entropy) produces longer acceptance lengths and more pronounced acceleration. The relationship follows: y = 4(1 – 0.58x^0.58), R² = 0.995

In standard tests with batch size 64, 16K input, and 1K output, 3-layer MTP achieves:

Up to 2.67× decoding speedup (accept length 3.8)
Average 2.39× decoding speedup (accept length 3.4)

Pre-Training: 27 Trillion Tokens of Knowledge Accumulation

MiMo-V2-Flash’s pre-training progressed through three meticulously designed stages:

Stage One: Foundation Building (0-22T tokens)

During this phase, the model encountered diverse high-quality data:

Public web content
Books
Academic papers
Programming code
Mathematical materials
STEM domain content

Training configuration:

Context length: 32,768 tokens
Learning rate: Linear warmup from 0 to 3.2×10⁻⁴, constant for 12T tokens, cosine decay to 1.0×10⁻⁴
Batch size: Linear warmup to 2,048 (first 500B tokens), then constant
MTP loss weight: 0.3

Stage Two: Mid-Training Enhancement (22-26T tokens)

This stage emphasized reasoning capabilities:

Upsampled code-related data
Introduced approximately 5% synthetic reasoning data
Learning rate: Cosine decay from 1.0×10⁻⁴ to 3.0×10⁻⁵
MTP loss weight: 0.1

Stage Three: Long-Context Extension (26-27T tokens)

The final stage extended context window to 256K:

Upsampled data with long-range dependencies
RoPE base frequency adjustment: GA from 640,000 to 5,000,000
Learning rate: Decay from 3.0×10⁻⁵ to 1.0×10⁻⁵
Batch size: 256

After these three stages, MiMo-V2-Flash-Base demonstrated outstanding performance across multiple benchmarks:

Mathematical Reasoning:

GSM8K: 92.3%
MATH: 71.0%
AIME 2024&2025: 35.3%

Coding Capabilities:

HumanEval+: 70.7%
MBPP+: 71.4%
BigCodeBench: 70.1%

Long-Context Retrieval:

32K: 99.3% accuracy
64K: 99.9% accuracy
128K: 98.6% accuracy
256K: 96.7% accuracy

Multi-Teacher On-Policy Distillation: A New Post-Training Paradigm

If pre-training builds the foundation, post-training provides refinement. MiMo-V2-Flash employs the innovative MOPD (Multi-Teacher On-Policy Distillation) paradigm, a three-stage refinement process.

Stage One: Supervised Fine-Tuning (SFT)

Establishing foundational instruction-following capabilities using millions of high-quality training samples covering:

General conversation
Reasoning tasks
Programming
Agent tasks
Thinking and non-thinking modes

Key training parameters:

Learning rate: Cosine decay from 5.0×10⁻⁵ to 5.0×10⁻⁶
Batch size: 128
AdamW ε: 1.0×10⁻⁸
MoE expert bias update rate: 1.0×10⁻⁴

Stage Two: Domain-Specialized Training

Training multiple domain-expert teacher models through reinforcement learning:

Non-Agentic RL:

Mathematical reasoning
Logical reasoning
Safety alignment
Code generation

Agentic RL:

Code debugging agents (90K real tasks + 30K synthetic tasks)
Search agents (150K synthetic tasks)
General tool agents (50K synthetic tasks)

Code agent training scale proves particularly impressive. Conducting on-policy rollouts and updates across 120K environments, the infrastructure includes:

Automated environment setup pipeline with 70% success rate
Support for 8 programming languages
Large-scale Kubernetes cluster running 10,000+ concurrent pods
Lightweight agent scaffold providing only 3 atomic tools (bash, str_replace, finish)

Stage Three: MOPD Distillation

This critical stage has the student model sample from its own distribution while receiving:

Token-level rewards: Dense supervision from domain-expert teachers
Outcome-level rewards: Verification from Outcome Reward Models (ORM)

Technically, MOPD implements reverse KL divergence loss:

Advantage = log[Teacher_Probability / Student_Probability] + α × ORM_Advantage

This approach offers distinct advantages:

Effective and efficient: Preserves peak capabilities of each teacher while avoiding capability trade-offs
Modular and scalable: Teachers can be RL models, SFT models, or even the student itself
Iterative co-evolution: Distilled students can be retrained as stronger teachers

Experimental data validates MOPD effectiveness:

Benchmark	Student (Pre-MOPD)	Best Teacher	Student (Post-MOPD)	Improvement
AIME 2025	89.3%	93.9% (RL)	94.1%	+0.2%
HMMT 2025	76.9%	82.6% (RL)	84.4%	+1.8%
LiveCodeBench	77.5%	82.6% (RL)	83.2%	+0.6%
HLE (No Tools)	21.2%	21.2% (Self)	22.8%	+1.6%
Arena-Hard (Hard Prompt)	50.0%	50.0% (Self)	54.1%	+4.1%

Real-World Performance: Head-to-Head with Top Models

Let the data speak. MiMo-V2-Flash’s performance across key benchmarks:

Code Agent Tasks (Strongest Domain)

SWE-Bench Verified: 73.4%

DeepSeek-V3.2: 73.1%
Kimi-K2: 71.3%
Claude Sonnet 4.5: 77.2%
GPT-5 High: 74.9%

SWE-Bench Multilingual: 71.7%

DeepSeek-V3.2: 70.2%
Kimi-K2: 61.1%
Claude Sonnet 4.5: 68.0%

This means MiMo-V2-Flash resolves over 70% of real GitHub issues, leading among open-source models in multilingual code tasks.

Mathematical Reasoning Capabilities

AIME 2025: 94.1%

Approaching human math competition performance
Solving 94% of high-school competition problems

HMMT Feb. 2025: 84.4%

Harvard-MIT Mathematics Tournament
Demonstrating competition-grade mathematical prowess

Complex Reasoning Tasks

GPQA-Diamond: 83.7%

Graduate-level scientific questions
Requiring deep domain knowledge

LiveCodeBench-v6: 80.6%

Real-time code generation evaluation
Avoiding training data contamination

Long-Context Understanding (Significant Advantage)

LongBench V2: 60.6%

Surpassing Kimi-K2 (45.1%)
Exceeding DeepSeek-V3.2 (58.4%)

MRCR (up to 128K context): 45.7%

Multi-needle retrieval tasks
Validating hybrid sliding window architecture effectiveness

General Capabilities

MMLU-Pro: 84.9%
Arena-Hard (Creative Writing): 86.2%
τ²-Bench (Tool Use): 80.3%

Breakthrough in Reinforcement Learning Training Scale

MiMo-V2-Flash’s code agent training reveals a stunning discovery: large-scale agent reinforcement learning training not only improves agent performance but generalizes to other task types.

Training Curve Analysis

After training across 120K environments:

SWE-Bench Verified: Improved from ~60% to 73.4%
SWE-Bench Multilingual: Improved from ~50% to 71.7%

More importantly, this training brought cross-domain capability improvements:

Mathematical Abilities:

AIME 2025: Improved from ~80% to 83%
HMMT Feb. 2025: Improved from ~64% to 68%

Other Code Tasks:

LiveCodeBench: Improved from ~71% to 74%

Reasoning Capabilities:

GPQA-Diamond: Improved from ~74% to 77%

General Tasks:

Arena-Hard (Hard Prompt): Improved from ~48% to 52%
Tau-2 Bench: Improved from ~72% to 76%

This demonstrates that agent training cultivates broadly transferable problem-solving capabilities, not merely task-specific optimization.

Training Infrastructure Innovation

To support such massive training, the development team built three key modules:

1. Rollout Routing Replay (R3)
Solves expert routing inconsistency between rollout and training in MoE models, making overhead negligible through optimized data types and communication overlapping.

2. Data Scheduler

Implements fine-grained sequence scheduling instead of micro-batch scheduling
Dynamically assigns new prompts based on historical pass rates
Supports partial rollout, partitioning overlong trajectories across steps
Employs staleness-aware truncated importance sampling

3. Toolbox and Tool Manager

Efficient scheduling based on Ray
Centralized resource allocation enforcing quota and QPS limits
Fault-tolerant actor pools eliminating cold-start delays
Environment pre-warming and sequence-level asynchronous reward computation

Practical Applications: From Theory to Practice

Understanding the technical principles, let’s examine what MiMo-V2-Flash can do in real scenarios.

Software Development Assistant

Imagine you’re a developer facing a complex GitHub issue:

Problem: In a Python project, a feature occasionally experiences data races in multi-threaded environments.

MiMo-V2-Flash’s Workflow:

Use bash tool to read relevant code files
Analyze root cause (likely unprotected shared state access)
Use str_replace tool to modify code, adding thread locks
Execute test cases to verify the fix
If tests fail, iterate adjustments until passing

After training in 120K real GitHub environments, the model successfully resolves 73.4% of such practical issues.

Mathematics Competition Tutoring

Question Type: AIME-level math competition problems

Model Capabilities:

Understanding complex mathematical problem statements
Selecting appropriate solution strategies
Performing multi-step reasoning
Verifying answer correctness

The 94.1% AIME 2025 score means correctly solving approximately 14 out of 15 competition problems.

Long Document Analysis

Scenario: Analyzing a 128K token (~100,000 word) technical document

Traditional Model Challenges:

Massive memory consumption
Slow processing speed
Potentially missing critical information

MiMo-V2-Flash Advantages:

Through sliding window attention, memory consumption is only 1/6 of traditional models
Global attention layers ensure no critical information is missed
In MRCR tests, maintains 45.7% accuracy even with multiple “needles” (information to retrieve) embedded in documents

Multilingual Code Migration

Task: Migrating Python codebase to Java

Model Requirements:

Understanding Python code logic
Knowledge of Java language features
Handling language differences (memory management, type systems)
Maintaining functional consistency

The 71.7% score on SWE-Bench Multilingual demonstrates MiMo-V2-Flash’s excellence in cross-language code tasks.

Inference Speed Optimization: Theory and Practice

Speed isn’t just a number—it directly impacts user experience and application costs.

Batch Size Impact on Speedup

In standard tests with 16K input and 1K output:

Batch Size	Accept Length 2.8	Accept Length 3.2	Accept Length 3.6
32	1.86×	2.12×	2.39×
64	1.97×	2.25×	2.53×
96	1.99×	2.28×	2.56×
128	1.82×	2.07×	2.33×

Key Findings:

Optimal acceleration at batch sizes 64-96
Speedup approximately linear with accept length
In actual deployment, tune batch size and MTP layers based on hardware roofline models

Acceleration Variance Across Tasks

High-Certainty Tasks (web development, template code generation):

Next token entropy: ~0.05
Average accept length: 3.6
Theoretical speedup: 2.6×

Medium-Certainty Tasks (general code generation):

Next token entropy: ~0.15
Average accept length: 3.2
Theoretical speedup: 2.3×

Low-Certainty Tasks (open-ended Q&A, creative writing):

Next token entropy: ~0.25
Average accept length: 2.9
Theoretical speedup: 2.0×

Pushing Long-Context Limits

MiMo-V2-Flash’s long-context performance deserves special attention, as it directly challenges theoretical limitations of sliding window attention.

GSM-Infinite: Extreme Long-Context Reasoning

This extreme stress test benchmark performs mathematical reasoning in ultra-long text with massive noise.

Test Setup:

Hard operands: {2, 4, 6, 8, 10}
5-shot setting

Performance Results:

Context Length	MiMo-V2-Flash	DeepSeek-V3.2-Exp
16K	37.7%	50.4%
32K	33.7%	45.2%
64K	31.5%	32.6%
128K	29.0%	25.7%

Key Findings:

MiMo-V2-Flash exhibits more gradual performance degradation
Surpasses DeepSeek-V3.2-Exp at 64K and 128K
Proves robustness of hybrid sliding window architecture in noisy environments

NIAH-Multi: Multi-Needle Retrieval Test

Retrieving multiple information points in long texts:

Context Length	Success Rate
32K	99.3%
64K	99.9%
128K	98.6%
256K	96.7%

Even at 256K ultra-long context, retrieval accuracy remains near 97%—an extraordinary achievement for sliding window attention models.

Training Stability Key: Zero-Gradient Parameter Monitoring

During MoE model supervised fine-tuning, the team discovered a crucial stability indicator: number of parameters with zero gradients (num-zeros).

Indicator Meaning:

Increasing num-zeros: Deteriorating load balance among experts, unstable training
Decreasing num-zeros: Model overfitting to training data
Stable num-zeros: Healthy training, good convergence

Stability Control Parameters:

MoE expert bias update rate: 1.0×10⁻⁴
AdamW ε parameter: 1.0×10⁻⁸
Sequence auxiliary loss coefficient: 1.0×10⁻⁶

This monitoring and control mechanism ensures robustness and convergence in subsequent reinforcement learning stages.

Open Source Commitment and Community Contribution

The Xiaomi team fully open-sourced MiMo-V2-Flash, including:

Main model weights (309B parameters)
3-layer MTP weights (0.33B parameters per layer)
Detailed technical report

Open Source Repository: https://github.com/XiaomiMiMo/MiMo-V2-Flash

This provides the AI research community with:

A reproducible efficient architecture reference
Practical experience in large-scale reinforcement learning training
Complete implementation of multi-teacher on-policy distillation

Frequently Asked Questions

What real-world applications suit MiMo-V2-Flash?

Code Development: Particularly excels at software engineering tasks, handling real GitHub issues, supporting 8 programming languages, suitable as IDE plugins or code review assistants.

Mathematics Education: Achieving 94.1% accuracy on AIME-level math competition problems, applicable for math tutoring, problem analysis, and solution strategy teaching.

Long Document Analysis: Supporting up to 256K context length, suitable for processing legal documents, technical documentation, research papers, and other long-form materials.

Multilingual Tasks: Excellent performance on SWE-Bench Multilingual, suitable for cross-language code migration and internationalization project development.

What advantages does MiMo-V2-Flash have over GPT-4 or Claude?

Open Source Transparency: Fully open-source, deployable locally, ensuring data privacy and security.

Inference Efficiency: 2.6× acceleration through hybrid attention and MTP, lower operational costs.

Parameter Efficiency: 309B total parameters activating only 15B, compared to Kimi-K2’s 1043B total parameters, dramatically lower storage and loading costs.

Code Task Specialization: 73.4% on SWE-Bench Verified, best performance among open-source models.

Long-Context Robustness: Superior to some larger models in noisy long-context reasoning environments.

How to deploy MiMo-V2-Flash in actual projects?

Hardware Requirements:

Recommended: Multiple A100 or H100 GPUs
Minimum: Single A100 80GB (may require quantization)
Memory: At least 512GB system RAM

Inference Engine Selection:

Recommend using SGLang, optimized for MTP and hybrid attention
Supports request-level prefix caching
Implements Rollout Routing Replay (R3) mechanism

Batch Size Tuning:

Adjust between 32-96 based on GPU model and task type
Monitor GPU utilization and throughput
Use hardware roofline models to guide optimization

MTP Layer Configuration:

Default: 3 layers (balancing speed and quality)
Speed-focused: Increase to 5 layers
Quality-focused: Reduce to 1 layer

Won’t sliding window attention limit model capabilities?

Theoretically: Sliding windows do constrain the context range visible at once—this is indeed a limitation.

Actual Performance:

Retrieval accuracy approaches 100% in 32K-256K tasks
Performance degradation more gradual than full-attention models in GSM-Infinite long-context reasoning
Across multiple benchmarks, even surpasses baseline models using global attention

Reasoning:

Attention sink bias teaches models when to “ignore” information
5:1 hybrid ratio finds optimal balance between efficiency and capability
Global attention layers ensure capture of critical long-range dependencies
Smaller windows may provide regularization effects, reducing overfitting

What advantages does MOPD have over traditional methods?

Traditional Method Problems:

Parameter merging: Simple weight averaging of multiple models often causes capability compromises (“see-saw effect”)
Offline distillation: Using static datasets generated by teacher models creates distribution mismatch
Sequential training: Training A then B may cause forgetting of A’s capabilities

MOPD Advantages:

On-policy sampling: Student samples from its own distribution, avoiding distribution mismatch
Token-level supervision: Dense reward signals enable more efficient learning
Multi-teacher synergy: Simultaneously preserves peak capabilities of all teachers
Modular design: Flexibly add new teachers, supports iterative improvement

Experimental Validation: On AIME 2025, MOPD not only preserves the RL teacher’s 93.9% performance but improves to 94.1%.

How does the model perform on Chinese tasks?

Chinese Benchmarks:

C-Eval: 87.9%
CMMLU: 87.4%
C-SimpleQA: 61.5%

Comparative Analysis:

On C-Eval and CMMLU, performs well but slightly below models specifically optimized for Chinese (like Kimi-K2’s 92.5% and 90.9%)
On C-SimpleQA, scores 61.5%, with a gap from Kimi-K2’s 77.6%

Reasoning:

Chinese content likely represents a relatively smaller proportion in MiMo-V2-Flash’s pre-training corpus
Knowledge-intensive tasks (like SimpleQA) require higher knowledge capacity; 309B total parameters face gaps compared to Kimi-K2’s 1043B

Applicable Scenarios: Despite room for improvement in Chinese knowledge tasks, still performs excellently in Chinese code and mathematical reasoning tasks.

How does MTP acceleration work?

Traditional Decoding Bottleneck:

Memory bandwidth limited, not compute limited
Each token generation requires complete forward pass
Batch parallelism only improves FFN efficiency, not attention efficiency

MTP Solution:

Multiple tokens per forward pass: Trading additional computation for fewer memory accesses
Token-level parallelism: Main model verifies MTP-generated candidate tokens in parallel
Speculative decoding: When most candidates are accepted, throughput significantly improves

Specific Process:

Main model generates hidden state
MTP module predicts next 3 tokens
Main model verifies these 3 tokens in parallel
Accepts verified tokens, rejects subsequent ones

Efficiency Analysis:

If all 3 tokens accepted, theoretical 3× speedup
Actual accept length ~2.9-3.6, achieving 2.0-2.6× speedup
Additional computational overhead <10% (lightweight MTP module)

What are the model’s limitations and future improvements?

Current Limitations:

Knowledge capacity: Gap with larger models on knowledge-intensive tasks like SimpleQA (20.6% vs 35.3%)
Creative writing: Arena-Hard creative writing score 86.2%, slightly below GPT-5’s 92.2%
Search agents: BrowseComp score 45.4% (improved to 58.3% with context management), room for improvement
Architecture exploration: Current architectural design remains preliminary with limited design trade-off analysis

Future Directions:

Expand model scale: Increase parameters and training compute to narrow gap with top closed-source models
Architecture research: More systematic exploration of efficient agent-oriented architecture designs
MOPD iteration: Expand compute scale for teacher-student co-evolution
Knowledge enhancement: Improve knowledge acquisition and storage mechanisms
Multimodal expansion: Integrate visual, audio, and other modalities

What’s the team size and engineering investment?

Core Contributors: 61 people (listed alphabetically by first name)

Additional Contributors: 65 people

Infrastructure Teams:

Xiaomi Data Platform Team
CloudML Team
NGK Team
MiChat Team
Mify Team
LLM-Plus Team

Training Resources:

Large-scale GPU clusters
Kubernetes clusters (running 10,000+ concurrent pods for code agent training)
High-performance storage systems

Development Cycle: From pre-training to post-training to open-sourcing, the entire project spanned approximately several months

This large-scale engineering investment ensured model quality and stability.

Conclusion: Perfect Balance of Efficiency and Capability

MiMo-V2-Flash represents a new direction in large language model development: instead of pursuing ever-larger parameter counts, achieve significant efficiency improvements through innovative architectural design, efficient training paradigms, and meticulous engineering optimization while maintaining robust capabilities.

309B total parameters, 15B active parameters, 128-token sliding windows, 3-layer MTP, MOPD post-training—behind these technical details lies relentless pursuit of the “fast and capable” goal. In critical tasks like code agents, mathematical reasoning, and long-context understanding, MiMo-V2-Flash has proven that smaller, refined models can compete with behemoths.

More importantly, the complete open-source commitment allows the entire AI community to benefit from these innovations. Whether researchers exploring new architectural designs or engineers deploying practical applications, MiMo-V2-Flash provides a high-quality starting point.

This isn’t the end of the efficiency revolution—it’s a new beginning.

Meticulous Analysis of Xiaomi MiMo-V2-Flash: The 309B Parameter Efficient AI for Code and Math