M2-CODER: The First Multilingual, Multimodal Code Generator That Actually Reads Diagrams

“Imagine handing an AI a flowchart instead of a wall of text—and getting clean, working code in return.”
— Research Team, Beihang University & Alibaba Group


Table of Contents

  1. The Gap No One Talked About
  2. Meet M2-CODER in Plain English
  3. Inside the 13.1-Million-Pair Training Set
  4. M2EVAL: A New Benchmark for “Look-&-Code”
  5. What 25+ Models Achieved—and Where They Failed
  6. Step-by-Step: Re-creating M2-CODER on Your Machine
  7. Real-World Use Cases
  8. Limitations & Ethical Notes
  9. Key Takeaways for Developers, Students, and Managers

The Gap No One Talked About

Most code-generation models are text-only.
In real software shops, however, requirements arrive as 「UML class diagrams」, 「sequence diagrams」, or 「flowcharts」 pinned to Jira tickets or whiteboards. Bridging that gap is exactly what M2-CODER was built to do.


Meet M2-CODER in Plain English

Component What It Does
「M2-CODER 7 B」 A fine-tuned vision-language model that takes both 「text instructions」 and 「diagram images」 as input, then emits code in 10+ languages.
「M2C-INSTRUCT」 A two-stage dataset totaling 「13.1 million」 multilingual samples—half rendered code images, half diagram-based problems.
「M2EVAL」 A 300-problem benchmark spanning 10 programming languages, each with a required diagram.

Inside the 13.1-Million-Pair Training Set

Two-Stage Recipe

Stage Source Samples Languages Image Focus
「Stage 1」 GitHub code snapshots 12.9 M 50+ Code screenshots rendered with Pygments
「Stage 2」 Curated tasks 168 K 20+ PlantUML & Mermaid diagrams

How Each Diagram Problem Is Born

  1. 「Prototype」

    • A Python task is drafted by an LLM, then hand-polished.
    • Includes prompt, solution, and unit tests.
  2. 「Diagramming」

    • The solution is fed back into an LLM to generate PlantUML/Mermaid.
    • Human reviewers correct layout, remove text redundancy, and embed crucial details 「only」 in the diagram.
  3. 「Translation」

    • Nine volunteer engineers (Master’s/PhD level) translate each problem into nine additional languages (C#, Java, Kotlin, PHP, JavaScript, TypeScript, Ruby, Swift, Scala).
    • LLMs assist, but every line is manually executed against unit tests before acceptance.

M2EVAL: A New Benchmark for “Look-&-Code”

Task Definition (Plain English)

「Input」

  • A short natural-language prompt.
  • One diagram image (PNG/SVG).

「Output」

  • Source code that compiles and passes 「all」 hidden unit tests.

Benchmark Snapshot

Stat Value
Total Problems 300
Languages Covered 10
Average Prompt Length 89 tokens
Average Solution Length 326 tokens
Average Test Cases 9

What 25+ Models Achieved—and Where They Failed

Overall Pass@1 Scores (Top 5)

Rank Model Size Avg. Score
1 GPT-4o 49.7 %
2 Gemini-2.5-Flash 48.7 %
3 Doubao-1.5-thinking 44.3 %
4 Claude-3.5-Sonnet 42.3 %
9 「M2-CODER (ours)」 「7 B」 「25.3 %」

Even the best model tops out below 50 %, confirming that 「reading diagrams correctly is still hard」.

Language-by-Language Heat Map

Language Best Score Worst Score
Python 66.7 % (Doubao-1.5-thinking) 3.3 % (MiniCPM-V-2.6)
JavaScript 53.3 % (multiple) 6.7 % (MiniCPM-V-2.6)
Swift 50.0 % (Claude-3.5-Sonnet) 0.0 % (DeepSeek-V3 text-only)
Scala 40.0 % (GPT-4o) 3.3 % (Llama-3.2-vision)

Takeaway: 「Scripting languages dominate; strongly-typed languages expose every tiny naming or visibility error.」


Step-by-Step: Re-creating M2-CODER on Your Machine

1. Hardware & OS

  • Linux Ubuntu 22.04
  • 8×NVIDIA A800 80 GB (or any Ampere/Hopper GPUs totaling ≥ 640 GB)
  • Docker with 「nvidia-container-toolkit」

2. One-Command Environment

git clone https://github.com/MCEVAL/MMCoder.git
cd MMCoder/docker
docker build -t m2-coder .
docker run --gpus all -v $(pwd)/data:/workspace/data -it m2-coder

3. Data Download

# Full dataset (~1 TB)
cd /workspace/data
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1.tar
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage2.tar
tar -xf stage1.tar && tar -xf stage2.tar

# Lightweight subset (~200 GB) for quick experiments
wget https://huggingface.co/datasets/MCEVAL/M2C-INSTRUCT/resolve/main/stage1_light.tar

4. Training (Two-Stage)

「Stage 1: Full-Parameter Fine-Tuning」

accelerate launch --config_file ds_config_zero2.yaml train.py \
  --model_name_or_path Qwen2-VL-7B-base \
  --dataset_name /workspace/data/stage1 \
  --output_dir checkpoints/stage1 \
  --per_device_train_batch_size 128 \
  --learning_rate 5e-5 \
  --num_train_epochs 1 \
  --max_seq_length 2048

「Stage 2: LLM-Only Fine-Tuning」

accelerate launch --config_file ds_config_zero2.yaml train.py \
  --model_name_or_path checkpoints/stage1 \
  --dataset_name /workspace/data/stage2 \
  --output_dir checkpoints/stage2 \
  --freeze_vision_tower true \
  --num_train_epochs 2 \
  --max_seq_length 6000

5. Local Evaluation

python evaluate.py \
  --model_path checkpoints/stage2 \
  --benchmark_path /workspace/data/M2EVAL \
  --languages python java javascript csharp

Real-World Use Cases

Scenario 1: Agile Sprint Planning

  • 「Input」: A hand-drawn sequence diagram showing user-login flow.
  • 「Output」: Java Spring controllers + service classes with all endpoints and DTOs.

Scenario 2: Legacy System Migration

  • 「Input」: 20-year-old PowerPoint with class diagrams of a COBOL module.
  • 「Output」: Python dataclasses + unit tests replicating the original behavior.

Scenario 3: CS Classroom

  • 「Input」: Professor sketches a binary-tree deletion algorithm on the whiteboard.
  • 「Output」: C++ implementation ready for students to compile and step through.

Limitations & Ethical Notes

Limitation Mitigation Plan
「Language Coverage」 Currently 10; roadmap adds 20 more by Q4.
「Task Scope」 Only code generation; debugging & refactoring are future work.
「Data License」 All GitHub sources filtered via OSI-approved licenses; commercial use allowed with attribution.
「Human Oversight」 Every diagram reviewed by two annotators; 100 % unit-test pass required.

Key Takeaways for Developers, Students, and Managers

For Developers

  • 「Immediate」: Use M2-CODER to bootstrap boilerplate from whiteboard photos.
  • 「Medium-term」: Contribute diagrams to enlarge the open dataset.

For Students

  • 「Learning Aid」: Paste your UML homework into the demo UI and compare the generated code with your own.
  • 「Skill Bridge」: Understand how high-level design translates into idiomatic syntax across languages.

For Engineering Managers

  • 「ROI」: Early adopters report 「30 % faster prototyping」 for green-field features.
  • 「Compliance」: On-prem Docker image keeps code and diagrams inside your firewall.

Quick Links


“The gap between napkin sketches and production code just got a lot smaller.”