EMMA: The 4B Multimodal AI That Outperforms 7B Rivals in Vision & Generation

1 months ago 高效码农

EMMA: The Most Impressive Unified Multimodal Model of 2025 (And It’s Only 4B Parameters) Every week in 2025, someone drops a new “unified vision-generation” model and claims the throne. Most of them are 7–13B behemoths that eat 4–8k visual tokens per image and still struggle with basic image editing. Then Huawei Noah’s Ark Lab quietly uploaded a 4B-parameter model called EMMA that beats almost every public 7B unified model across understanding, text-to-image generation, and image editing — while using only 20% of the visual tokens of its competitors. This isn’t marketing fluff. These are head-to-head numbers from the paper. What …