EMMA: The 4B Multimodal AI That Outperforms 7B Rivals in Vision & Generation

高效码农

2 months ago

EMMA: The Most Impressive Unified Multimodal Model of 2025 (And It’s Only 4B Parameters)

Every week in 2025, someone drops a new “unified vision-generation” model and claims the throne. Most of them are 7–13B behemoths that eat 4–8k visual tokens per image and still struggle with basic image editing.

Then Huawei Noah’s Ark Lab quietly uploaded a 4B-parameter model called EMMA that beats almost every public 7B unified model across understanding, text-to-image generation, and image editing — while using only 20% of the visual tokens of its competitors.

This isn’t marketing fluff. These are head-to-head numbers from the paper.

What Makes EMMA So Ridiculously Efficient?

1. A 32× Compression Autoencoder (DCAE) That Changes Everything

Most unified models today (BAGEL, Janus, UniWorld, etc.) use the classic 8× VAE from Stable Diffusion XL + 2×2 token merging → maximum 16× compression.
A single 1024×1024 image ends up as ~4096 visual tokens. That kills context length and VRAM.

EMMA throws that playbook away.

They use a brand-new Deep Compression Autoencoder (DCAE) with a native 32× compression ratio.
The same 1024×1024 image becomes only 1024 visual tokens.

Even better: they force the understanding branch (SigLIP2 + pixel-shuffle) to use the exact same 32× ratio.
Result? The tokens from the understanding encoder and the generation encoder have identical spatial dimensions → they can be fused via channel-wise concatenation instead of wasteful sequence concatenation.

In image editing tasks, EMMA uses ~1/5 the visual tokens of BAGEL-7B yet produces cleaner, more consistent results.

2. Shared-and-Decoupled Architecture (Best of Both Worlds)

Understanding needs strong semantics. Generation needs semantics + high-frequency details. Forcing total parameter sharing hurts both sides.

EMMA’s clever compromise:

Shallow layers → fully shared (cross-task knowledge transfer)
Deep layers → fully decoupled (separate Transformers for understanding vs generation)
Even in shared shallow layers, the Value projection stays task-specific

Think of it as “dating in the first half of the network, living separately in the second half.”

3. Mixture-of-Experts Inside the Vision Encoder

STEM images (charts, equations, diagrams) have always been the Achilles’ heel of general vision encoders.

EMMA adds a tiny ~50M-parameter STEM expert on top of SigLIP2. A lightweight router decides on-the-fly whether to use the general expert or the STEM expert.
The STEM expert is only trained in the final stage → almost zero cost, massive gain on MMMU, MathVista, ScienceQA, etc.

Cold Hard Numbers (All from the Paper)

Model	Params	MMBench	MMMU	MMVet	GenEval (no rewriting)	DPG-Bench	GEdit-Bench-EN
BAGEL	7B	85.0	55.3	67.2	0.88	85.07	6.52
Mogao	7B	75.0	44.2	—	0.89	84.33	—
UniWorld-V1	7B	83.5	58.6	67.1	0.84	81.38	4.85
OmniGen2	~3B	79.1	53.1	61.8	0.86	83.57	6.42
EMMA (ours)	4B	85.8	62.5	73.0	0.93	85.63	6.53

MMVet: 73.0 (BAGEL-7B only reached 67.2)
GenEval: 0.93 without any prompt rewriting or RL (Qwen-Image: 0.91 with rewriting)
Image editing: 5× fewer visual tokens than BAGEL yet slightly higher score

A 4B model is embarrassing 7B models in 2025. Let that sink in.

Training Data Breakdown (Transparency FTW)

Category	Alignment	Pre-training	SFT	Quality Tuning (QT)	STEM Expert	Approx. Total
Understanding (I2T)	0.56M	~520M	~120M	1M	15M	~540M
Text-to-Image (T2I)	—	~705M	~105M	0.15M	—	~705M
Image-to-Image Editing	—	~12M	0.35M	(synthetic)	—	~12.35M

Important note: the authors deliberately refused to use the popular GPT-Image-Edit-1.5M dataset because it destroys subject consistency even though it pumps GEdit scores. They chose real editing quality over leaderboard gaming.

Emergent Abilities That Blew My Mind

Zero-shot Chinese generation & editing
No Chinese text-to-image or editing data was used at all — yet EMMA perfectly understands and executes Chinese prompts because the understanding data contained Chinese captions.
Complex multi-step editing from single-step training
Trained only on single-sentence edits, but handles complex chained instructions flawlessly thanks to Chain-of-Thought data in the understanding stage.

(Examples in Figure 5 of the paper are genuinely impressive)

Frequently Asked Questions (FAQ)

Q: Is EMMA open-source yet?
Not yet. The project page (https://emma-umm.github.io/emma/) is live and the authors say “code & weights coming soon”. Fingers crossed for Apache 2.0.

Q: How fast is inference on consumer GPUs?
Thanks to only ~1024 visual tokens, a single RTX 4090 can run text-to-image and editing comfortably. The paper claims ~3–5× faster than BAGEL-7B on the same hardware.

Q: How does raw image quality compare to FLUX or Qwen-Image?
Pure text-to-image aesthetics are slightly behind top-tier diffusion specialists (32× VAE is aggressive), but the gap is surprisingly small, and the unified understanding + editing ability is unmatched.

Q: Will there be 8B/13B versions?
The entire paper is framed as “proof of efficiency-first design”. Since 4B already outperforms 7B competitors, larger versions are almost certainly in the pipeline.

Final Thoughts

EMMA proves you don’t need 13B parameters, 8k visual tokens, or billions of synthetic editing pairs to dominate multimodal benchmarks.

Sometimes the biggest leaps come from asking the right question:
“How little can we get away with while still winning?”

The answer, apparently, is “a lot less than everyone thought.”

If you’re building agents, creative tools, or any product that needs to see, reason, draw, and edit — keep an eye on this repo. When the weights drop, the entire open-source multimodal landscape is going to feel it.

Project page (highly recommended): https://emma-umm.github.io/emma/
Paper: https://arxiv.org/abs/2512.04810

2025 just got a lot more interesting.