M2-CODER: The First Multilingual, Multimodal Code Generator That Actually Reads Diagrams
❝
“Imagine handing an AI a flowchart instead of a wall of text—and getting clean, working code in return.”
— Research Team, Beihang University & Alibaba Group❞
Table of Contents
-
The Gap No One Talked About -
Meet M2-CODER in Plain English -
Inside the 13.1-Million-Pair Training Set -
M2EVAL: A New Benchmark for “Look-&-Code” -
What 25+ Models Achieved—and Where They Failed -
Step-by-Step: Re-creating M2-CODER on Your Machine -
Real-World Use Cases -
Limitations & Ethical Notes -
Key Takeaways for Developers, Students, and Managers
The Gap No One Talked About
Most code-generation models are text-only.
In real software shops, however, requirements arrive as 「UML class diagrams」, 「sequence diagrams」, or 「flowcharts」 pinned to Jira tickets or whiteboards. Bridging that gap is exactly what M2-CODER was built to do.
Meet M2-CODER in Plain English
Component | What It Does |
---|---|
「M2-CODER 7 B」 | A fine-tuned vision-language model that takes both 「text instructions」 and 「diagram images」 as input, then emits code in 10+ languages. |
「M2C-INSTRUCT」 | A two-stage dataset totaling 「13.1 million」 multilingual samples—half rendered code images, half diagram-based problems. |
「M2EVAL」 | A 300-problem benchmark spanning 10 programming languages, each with a required diagram. |
Inside the 13.1-Million-Pair Training Set
Two-Stage Recipe
Stage | Source | Samples | Languages | Image Focus |
---|---|---|---|---|
「Stage 1」 | GitHub code snapshots | 12.9 M | 50+ | Code screenshots rendered with Pygments |
「Stage 2」 | Curated tasks | 168 K | 20+ | PlantUML & Mermaid diagrams |
How Each Diagram Problem Is Born
-
「Prototype」
-
A Python task is drafted by an LLM, then hand-polished. -
Includes prompt, solution, and unit tests.
-
-
「Diagramming」
-
The solution is fed back into an LLM to generate PlantUML/Mermaid. -
Human reviewers correct layout, remove text redundancy, and embed crucial details 「only」 in the diagram.
-
-
「Translation」
-
Nine volunteer engineers (Master’s/PhD level) translate each problem into nine additional languages (C#, Java, Kotlin, PHP, JavaScript, TypeScript, Ruby, Swift, Scala). -
LLMs assist, but every line is manually executed against unit tests before acceptance.
-
M2EVAL: A New Benchmark for “Look-&-Code”
Task Definition (Plain English)
「Input」
-
A short natural-language prompt. -
One diagram image (PNG/SVG).
「Output」
-
Source code that compiles and passes 「all」 hidden unit tests.
Benchmark Snapshot
Stat | Value |
---|---|
Total Problems | 300 |
Languages Covered | 10 |
Average Prompt Length | 89 tokens |
Average Solution Length | 326 tokens |
Average Test Cases | 9 |
What 25+ Models Achieved—and Where They Failed
Overall Pass@1 Scores (Top 5)
Rank | Model | Size | Avg. Score |
---|---|---|---|
1 | GPT-4o | — | 49.7 % |
2 | Gemini-2.5-Flash | — | 48.7 % |
3 | Doubao-1.5-thinking | — | 44.3 % |
4 | Claude-3.5-Sonnet | — | 42.3 % |
9 | 「M2-CODER (ours)」 | 「7 B」 | 「25.3 %」 |
❝
Even the best model tops out below 50 %, confirming that 「reading diagrams correctly is still hard」.
❞
Language-by-Language Heat Map
Language | Best Score | Worst Score |
---|---|---|
Python | 66.7 % (Doubao-1.5-thinking) | 3.3 % (MiniCPM-V-2.6) |
JavaScript | 53.3 % (multiple) | 6.7 % (MiniCPM-V-2.6) |
Swift | 50.0 % (Claude-3.5-Sonnet) | 0.0 % (DeepSeek-V3 text-only) |
Scala | 40.0 % (GPT-4o) | 3.3 % (Llama-3.2-vision) |
Takeaway: 「Scripting languages dominate; strongly-typed languages expose every tiny naming or visibility error.」
Step-by-Step: Re-creating M2-CODER on Your Machine
1. Hardware & OS
-
Linux Ubuntu 22.04 -
8×NVIDIA A800 80 GB (or any Ampere/Hopper GPUs totaling ≥ 640 GB) -
Docker with 「nvidia-container-toolkit」
2. One-Command Environment
git clone https://github.com/MCEVAL/MMCoder.git
cd MMCoder/docker
docker build -t m2-coder .
docker run --gpus all -v $(pwd)/data:/workspace/data -it m2-coder
3. Data Download
# Full dataset (~1 TB)
cd /workspace/data
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1.tar
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage2.tar
tar -xf stage1.tar && tar -xf stage2.tar
# Lightweight subset (~200 GB) for quick experiments
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1_light.tar
4. Training (Two-Stage)
「Stage 1: Full-Parameter Fine-Tuning」
accelerate launch --config_file ds_config_zero2.yaml train.py \
--model_name_or_path Qwen2-VL-7B-base \
--dataset_name /workspace/data/stage1 \
--output_dir checkpoints/stage1 \
--per_device_train_batch_size 128 \
--learning_rate 5e-5 \
--num_train_epochs 1 \
--max_seq_length 2048
「Stage 2: LLM-Only Fine-Tuning」
accelerate launch --config_file ds_config_zero2.yaml train.py \
--model_name_or_path checkpoints/stage1 \
--dataset_name /workspace/data/stage2 \
--output_dir checkpoints/stage2 \
--freeze_vision_tower true \
--num_train_epochs 2 \
--max_seq_length 6000
5. Local Evaluation
python evaluate.py \
--model_path checkpoints/stage2 \
--benchmark_path /workspace/data/M2EVAL \
--languages python java javascript csharp
Real-World Use Cases
Scenario 1: Agile Sprint Planning
-
「Input」: A hand-drawn sequence diagram showing user-login flow. -
「Output」: Java Spring controllers + service classes with all endpoints and DTOs.
Scenario 2: Legacy System Migration
-
「Input」: 20-year-old PowerPoint with class diagrams of a COBOL module. -
「Output」: Python dataclasses + unit tests replicating the original behavior.
Scenario 3: CS Classroom
-
「Input」: Professor sketches a binary-tree deletion algorithm on the whiteboard. -
「Output」: C++ implementation ready for students to compile and step through.
Limitations & Ethical Notes
Limitation | Mitigation Plan |
---|---|
「Language Coverage」 | Currently 10; roadmap adds 20 more by Q4. |
「Task Scope」 | Only code generation; debugging & refactoring are future work. |
「Data License」 | All GitHub sources filtered via OSI-approved licenses; commercial use allowed with attribution. |
「Human Oversight」 | Every diagram reviewed by two annotators; 100 % unit-test pass required. |
Key Takeaways for Developers, Students, and Managers
For Developers
-
「Immediate」: Use M2-CODER to bootstrap boilerplate from whiteboard photos. -
「Medium-term」: Contribute diagrams to enlarge the open dataset.
For Students
-
「Learning Aid」: Paste your UML homework into the demo UI and compare the generated code with your own. -
「Skill Bridge」: Understand how high-level design translates into idiomatic syntax across languages.
For Engineering Managers
-
「ROI」: Early adopters report 「30 % faster prototyping」 for green-field features. -
「Compliance」: On-prem Docker image keeps code and diagrams inside your firewall.
Quick Links
-
「Code & Weights」: GitHub -
「Dataset (Full)」: Hugging Face – M2C-INSTRUCT -
「Benchmark」: Hugging Face – M2EVAL -
「Paper」: arXiv 2507.08719
❝
“The gap between napkin sketches and production code just got a lot smaller.”
❞