MindVL: Efficient Multimodal AI Training on Ascend NPUs

Explore how Huawei’s MindVL achieves state-of-the-art performance while using 90% less training data than comparable models.

Introduction to Multimodal AI Challenges

Multimodal Large Language Models (MLLMs) like Qwen2.5-VL and GPT-4V have transformed how machines understand visual and textual information. However, two persistent challenges remain:

Hardware Limitations: Most MLLMs rely on NVIDIA GPUs, creating barriers for environments using alternative accelerators like Huawei’s Ascend NPUs.
Data Efficiency: Training these models typically requires massive datasets (often exceeding 4 trillion tokens), raising costs and carbon footprint concerns.

MindVL emerges as a breakthrough solution, demonstrating that high performance can be achieved with:

10x less training data than Qwen2.5-VL
Optimized training on Ascend NPUs
Superior OCR capabilities while matching general multimodal understanding benchmarks

MindVL processes complex visual data at native resolutions

Core Innovation: Native-Resolution Processing

Traditional MLLMs resize images to fixed dimensions, losing critical details in the process. MindVL adopts a vision transformer architecture that:

Accepts Variable Resolutions: Maintains original image dimensions (resized only to multiples of 28)
Uses 2D Rotational Position Encoding (RoPE): Preserves spatial relationships in images
Dynamic Feature Grouping: Reduces computational load while retaining fine details

Real-World Impact: This architecture excels at interpreting dense visual content like technical diagrams, financial charts, and medical imaging where precision matters.

Hardware-Specific Optimization: MindSpeed-MLLM

Training large models on non-NVIDIA hardware requires overcoming ecosystem limitations. The team developed:

Key Framework Features:

Distributed Data Loading: Avoids bottlenecks when scaling across multiple NPUs
Operator Fusion: Replaces computationally expensive operations with hardware-optimized equivalents
Core Binding: Reduces latency by pinning tasks to specific CPU cores
Pipeline Parallelism: Overlaps computation and communication tasks

Performance Note: Achieved 40% Model FLOPS Utilization (MFU) on 910B NPU clusters – comparable to NVIDIA A100 performance.

Three-Phase Training Strategy

MindVL’s training follows a progressive approach:

### Phase	Data Volume	Key Focus	Technical Details
Warm-Up	256B tokens	Visual-Language Alignment	Trains only MLP adapter connecting vision encoder to LLM
Multitask	179B tokens	Complex Reasoning	Full parameter tuning with interleaved image-text data
SFT	12B tokens	Instruction Following	Mixed sequence lengths (2K/4K/8K tokens)

Data Composition:

300B tokens: Image-text pairs (filtered via CLIP scoring + clustering)
140B tokens: Multitask data (OCR, tables, STEM content, GUI interactions)
Language Preservation: 20% of SFT data uses DeepSeek R1-generated text to maintain linguistic capabilities

Performance Benchmarks

MindVL (8B parameters) was evaluated against leading models using 447B tokens total:

Model	Training Data	MME	MMBench	OCRBench	Overall
Qwen2.5-VL7B	4.1T+	83.8	82.6	86.4	86.4
MindVL8B	447B	84.1	84.3	87.6	86.5

Key Advantages:

OCR Superiority: Outperforms Qwen2.5-VL by 1.2% in OCR tasks
Data Efficiency: Achieves parity with 90% less training data

Model Optimization Techniques

1. Weight Averaging

Combines models trained with different sequence lengths (2K/4K/8K)
Improves MMBench score by 1.9% compared to individual models

2. Test-Time Resolution Search

Developed a grid search strategy for optimal inference resolution:

Dataset	Recommended Settings
OCRBench	min_pixels=162828
Document QA	max_pixels=30722828

Limitations and Future Work

Current constraints include:

Model Size: Only 8B parameter version released
Data Scale: Training limited to 447B tokens vs. trillion+ token models
Evaluation Scope: Focused on mainstream benchmarks, not specialized tasks

Promising Development:
A 67B parameter version using DeepSeek-V3 as backbone achieved:

26B tokens (1/100 of Qwen2.5-VL’s data)
Superior performance on certain tasks vs Qwen2.5-VL 72B

Technical Implementation Guide

Hardware Requirements

Minimum: Huawei Ascend 910B NPU
Recommended: 8+ NPU cluster for training

Inference Code Snippet

# Resolution optimization example
def dynamic_resize(image):
    min_pixels = 16*28*28  # OCR task setting
    max_pixels = 3072*28*28  # Document QA setting
    # Implementation logic...

FAQ

Q: How does MindVL handle different image sizes?

A: Uses 2D RoPE encoding to process variable resolutions while maintaining spatial relationships.

Q: What languages does it support?

A: Primarily optimized for Chinese but performs equally well in English benchmarks.

Q: Where can I access the model?

A: Currently available through Huawei Cloud’s ModelArts platform.

Q: Why does OCR performance excel?

A: Combination of native resolution processing + specialized document OCR data + resolution search optimization.

Conclusion

MindVL demonstrates that high-performance MLLMs can be trained efficiently on alternative hardware. By focusing on architecture optimization and data quality over quantity, it achieves state-of-the-art results while using significantly fewer resources. The success on Ascend NPUs opens possibilities for more diverse AI hardware ecosystems.

MindVL’s architecture enables processing of complex visual data

This article synthesizes technical details from “MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs” (2025), retaining all core specifications while optimizing for readability.