M2-CODER: The First Multilingual, Multimodal Code Generator That Actually Reads Diagrams

❝

“Imagine handing an AI a flowchart instead of a wall of text—and getting clean, working code in return.”
— Research Team, Beihang University & Alibaba Group

❞

The Gap No One Talked About
Meet M2-CODER in Plain English
Inside the 13.1-Million-Pair Training Set
M2EVAL: A New Benchmark for “Look-&-Code”
What 25+ Models Achieved—and Where They Failed
Step-by-Step: Re-creating M2-CODER on Your Machine
Real-World Use Cases
Limitations & Ethical Notes
Key Takeaways for Developers, Students, and Managers

The Gap No One Talked About

Most code-generation models are text-only.
In real software shops, however, requirements arrive as 「UML class diagrams」, 「sequence diagrams」, or 「flowcharts」 pinned to Jira tickets or whiteboards. Bridging that gap is exactly what M2-CODER was built to do.

Meet M2-CODER in Plain English

Component	What It Does
「M2-CODER 7 B」	A fine-tuned vision-language model that takes both 「text instructions」 and 「diagram images」 as input, then emits code in 10+ languages.
「M2C-INSTRUCT」	A two-stage dataset totaling 「13.1 million」 multilingual samples—half rendered code images, half diagram-based problems.
「M2EVAL」	A 300-problem benchmark spanning 10 programming languages, each with a required diagram.

Inside the 13.1-Million-Pair Training Set

Two-Stage Recipe

Stage	Source	Samples	Languages	Image Focus
「Stage 1」	GitHub code snapshots	12.9 M	50+	Code screenshots rendered with Pygments
「Stage 2」	Curated tasks	168 K	20+	PlantUML & Mermaid diagrams

How Each Diagram Problem Is Born

「Prototype」
- A Python task is drafted by an LLM, then hand-polished.
- Includes prompt, solution, and unit tests.
「Diagramming」
- The solution is fed back into an LLM to generate PlantUML/Mermaid.
- Human reviewers correct layout, remove text redundancy, and embed crucial details 「only」 in the diagram.
「Translation」
- Nine volunteer engineers (Master’s/PhD level) translate each problem into nine additional languages (C#, Java, Kotlin, PHP, JavaScript, TypeScript, Ruby, Swift, Scala).
- LLMs assist, but every line is manually executed against unit tests before acceptance.

M2EVAL: A New Benchmark for “Look-&-Code”

Task Definition (Plain English)

「Input」

A short natural-language prompt.
One diagram image (PNG/SVG).

「Output」

Source code that compiles and passes 「all」 hidden unit tests.

Benchmark Snapshot

Stat	Value
Total Problems	300
Languages Covered	10
Average Prompt Length	89 tokens
Average Solution Length	326 tokens
Average Test Cases	9

What 25+ Models Achieved—and Where They Failed

Overall Pass@1 Scores (Top 5)

Rank	Model	Size	Avg. Score
1	GPT-4o	—	49.7 %
2	Gemini-2.5-Flash	—	48.7 %
3	Doubao-1.5-thinking	—	44.3 %
4	Claude-3.5-Sonnet	—	42.3 %
9	「M2-CODER (ours)」	「7 B」	「25.3 %」

❝

Even the best model tops out below 50 %, confirming that 「reading diagrams correctly is still hard」.

❞

Language-by-Language Heat Map

Language	Best Score	Worst Score
Python	66.7 % (Doubao-1.5-thinking)	3.3 % (MiniCPM-V-2.6)
JavaScript	53.3 % (multiple)	6.7 % (MiniCPM-V-2.6)
Swift	50.0 % (Claude-3.5-Sonnet)	0.0 % (DeepSeek-V3 text-only)
Scala	40.0 % (GPT-4o)	3.3 % (Llama-3.2-vision)

Takeaway: 「Scripting languages dominate; strongly-typed languages expose every tiny naming or visibility error.」

Step-by-Step: Re-creating M2-CODER on Your Machine

1. Hardware & OS

Linux Ubuntu 22.04
8×NVIDIA A800 80 GB (or any Ampere/Hopper GPUs totaling ≥ 640 GB)
Docker with 「nvidia-container-toolkit」

2. One-Command Environment

git clone https://github.com/MCEVAL/MMCoder.git
cd MMCoder/docker
docker build -t m2-coder .
docker run --gpus all -v $(pwd)/data:/workspace/data -it m2-coder

3. Data Download

# Full dataset (~1 TB)
cd /workspace/data
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1.tar
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage2.tar
tar -xf stage1.tar && tar -xf stage2.tar

# Lightweight subset (~200 GB) for quick experiments
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1_light.tar

4. Training (Two-Stage)

「Stage 1: Full-Parameter Fine-Tuning」

accelerate launch --config_file ds_config_zero2.yaml train.py \
  --model_name_or_path Qwen2-VL-7B-base \
  --dataset_name /workspace/data/stage1 \
  --output_dir checkpoints/stage1 \
  --per_device_train_batch_size 128 \
  --learning_rate 5e-5 \
  --num_train_epochs 1 \
  --max_seq_length 2048

「Stage 2: LLM-Only Fine-Tuning」

accelerate launch --config_file ds_config_zero2.yaml train.py \
  --model_name_or_path checkpoints/stage1 \
  --dataset_name /workspace/data/stage2 \
  --output_dir checkpoints/stage2 \
  --freeze_vision_tower true \
  --num_train_epochs 2 \
  --max_seq_length 6000

5. Local Evaluation

python evaluate.py \
  --model_path checkpoints/stage2 \
  --benchmark_path /workspace/data/M2EVAL \
  --languages python java javascript csharp

Real-World Use Cases

Scenario 1: Agile Sprint Planning

「Input」: A hand-drawn sequence diagram showing user-login flow.
「Output」: Java Spring controllers + service classes with all endpoints and DTOs.

Scenario 2: Legacy System Migration

「Input」: 20-year-old PowerPoint with class diagrams of a COBOL module.
「Output」: Python dataclasses + unit tests replicating the original behavior.

Scenario 3: CS Classroom

「Input」: Professor sketches a binary-tree deletion algorithm on the whiteboard.
「Output」: C++ implementation ready for students to compile and step through.

Limitations & Ethical Notes

Limitation	Mitigation Plan
「Language Coverage」	Currently 10; roadmap adds 20 more by Q4.
「Task Scope」	Only code generation; debugging & refactoring are future work.
「Data License」	All GitHub sources filtered via OSI-approved licenses; commercial use allowed with attribution.
「Human Oversight」	Every diagram reviewed by two annotators; 100 % unit-test pass required.

Key Takeaways for Developers, Students, and Managers

For Developers

「Immediate」: Use M2-CODER to bootstrap boilerplate from whiteboard photos.
「Medium-term」: Contribute diagrams to enlarge the open dataset.

For Students

「Learning Aid」: Paste your UML homework into the demo UI and compare the generated code with your own.
「Skill Bridge」: Understand how high-level design translates into idiomatic syntax across languages.

For Engineering Managers

「ROI」: Early adopters report 「30 % faster prototyping」 for green-field features.
「Compliance」: On-prem Docker image keeps code and diagrams inside your firewall.

Quick Links

「Code & Weights」: GitHub
「Dataset (Full)」: Hugging Face – M2C-INSTRUCT
「Benchmark」: Hugging Face – M2EVAL
「Paper」: arXiv 2507.08719

❝

“The gap between napkin sketches and production code just got a lot smaller.”

❞

M2-CODER: Revolutionizing Code Generation with Multimodal Diagram Interpretation

M2-CODER: The First Multilingual, Multimodal Code Generator That Actually Reads Diagrams

Table of Contents

The Gap No One Talked About

Meet M2-CODER in Plain English

Inside the 13.1-Million-Pair Training Set

Two-Stage Recipe

How Each Diagram Problem Is Born

M2EVAL: A New Benchmark for “Look-&-Code”

Task Definition (Plain English)

Benchmark Snapshot

What 25+ Models Achieved—and Where They Failed

Overall Pass@1 Scores (Top 5)

Language-by-Language Heat Map

Step-by-Step: Re-creating M2-CODER on Your Machine

1. Hardware & OS

2. One-Command Environment

3. Data Download

4. Training (Two-Stage)

5. Local Evaluation

Real-World Use Cases

Scenario 1: Agile Sprint Planning

Scenario 2: Legacy System Migration

Scenario 3: CS Classroom

Limitations & Ethical Notes

Key Takeaways for Developers, Students, and Managers

For Developers

For Students

For Engineering Managers

Quick Links

M2-CODER: Revolutionizing Code Generation with Multimodal Diagram Interpretation

M2-CODER: The First Multilingual, Multimodal Code Generator That Actually Reads Diagrams

Table of Contents

The Gap No One Talked About

Meet M2-CODER in Plain English

Inside the 13.1-Million-Pair Training Set

Two-Stage Recipe

How Each Diagram Problem Is Born

M2EVAL: A New Benchmark for “Look-&-Code”

Task Definition (Plain English)

Benchmark Snapshot

What 25+ Models Achieved—and Where They Failed

Overall Pass@1 Scores (Top 5)

Language-by-Language Heat Map

Step-by-Step: Re-creating M2-CODER on Your Machine

1. Hardware & OS

2. One-Command Environment

3. Data Download

4. Training (Two-Stage)

5. Local Evaluation

Real-World Use Cases

Scenario 1: Agile Sprint Planning

Scenario 2: Legacy System Migration

Scenario 3: CS Classroom

Limitations & Ethical Notes

Key Takeaways for Developers, Students, and Managers

For Developers

For Students

For Engineering Managers

Quick Links

Related Posts