Site icon Efficient Coder

Emu3.5 Explained: One Model That Generates Images, Text, and Worlds

Emu3.5 in Plain English: One Autoregressive Model for Images, Text, and World Simulation

What’s the big deal?
Emu3.5 treats images, text, and video frames as one long token stream and learns to predict the next token—nothing else. The result is a single checkpoint that can chat, draw, edit, tell stories, give step-by-step visual tutorials, explore imaginary worlds, and even plan robot actions—without any task-specific heads.


Table of Contents

  1. Quick Glance
  2. Why “Next Token” Works for Pictures
  3. Training Diet: 13 Trillion Multimodal Tokens
  4. Post-Training Magic: RL That Knows Beauty, OCR, Physics
  5. DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image
  6. Native Abilities in Action
  7. Local Inference: 30-Minute Walk-Through
  8. Benchmarks at a Glance
  9. Limitations & What’s Next
  10. Action Checklist
  11. One-Page Overview
  12. FAQ

1. Quick Glance

Spec Value
Parameters 34.1 B (31.2 B transformer, 2.9 B embeddings)
Visual tokens 131 k code-book, 16× down-sampling
Text tokens 151 k (Qwen base)
Context length 32 768
Pre-train data 13 T tokens (10 T stage-1 + 3 T stage-2)
RL data 100 k prompts across 7 task families
Inference speed 20× faster with DiDA (≈ 10 s for 1024×1024)
Output resolution Up to 2048×2048 px, any aspect ratio
Code & weights Fully open-source, MIT-style licence

2. Why “Next Token” Works for Pictures

Core question: “How can predicting the next discrete token create a coherent image?”

One-sentence answer: Every image is first quantized into a short sequence of visual tokens; once the model learns to predict the next visual token just like the next word, generation becomes a unified streaming problem.

  • 🍄
    Tokenizer: IBQ framework, 455 M params, 16× down-sampling → 512×512 image ≈ 1 k tokens.
  • 🍄
    Compression vs. quality: Vanilla decoder already beats earlier 4×-larger codes; optional diffusion decoder doubles resolution and restores fine text.
  • 🍄
    Attention mask: Causal for text, causal for inter-image relations, but bidirectional inside one image during DiDA—parallel refinement without breaking autoregressive property across modalities.

Scenario:
I give the model a 256×256 crop of a busy kitchen and the text:
“Continue the comic in vintage style: the chef flips pancakes, flame jumps up.”
Emu3.5 streams 2 k new tokens; 8 s later I receive a 4-panel comic, chef silhouette and flame hue perfectly style-consistent. No ControlNet, no U-Net, just next-token.

Author’s reflection: The first time I saw a single loss curve drop for both Shakespeare and 4 k images, I finally believed modality is just an illusion of format.


3. Training Diet: 13 Trillion Multimodal Tokens

Core question: “What exactly was fed into the model?”

One-sentence answer: 63 M long videos, 500 M image-text pairs, 27 M any-to-image samples, and 3 T plain text tokens—heavily filtered, re-captioned, and packed into 32 k-token windows.

3.1 Video-Interleaved Corpus (≈ 55 % of tokens)

  • 🍄
    Average 6.5 min clips, 0.27 keyframes/s, Whisper-v2 transcripts aligned to 0.1 s.
  • 🍄
    Basic filters: remove < 480 p, talking-head clips, multi-lingual or silent outliers.
  • 🍄
    Advanced filters: DeQA perceptual score, DINO feature de-duplication, LLM text grade.
  • 🍄
    Second-stage enrichment: scene summaries, visual captions, multimodal abstracts.

Scenario:
A 3-min YouTube cooking excerpt becomes 48 keyframes + 640 text tokens. During training the model sees:
[TEXT]"add olive oil" → [IMAGE: pan] → [TEXT]”swirl to coat” → [IMAGE: shiny surface]…
It learns temporal causality—useful later for Visual Guidance and robot key-frame planning.

3.2 Paired & Synthetic Data

  • 🍄
    Image-text pairs re-captioned by Qwen2.5-VL for denser nouns and attributes.
  • 🍄
    Synthetic T2I outputs from open-source diffusion models augment artistic styles.
  • 🍄
    Video-text pairs motion-scored to keep dynamic clips; packed chronologically to mimic long sequences.

3.3 Text-Only (≈ 18 %)

Keeps linguistic reasoning alive while visual tokens dominate compute; loss re-weighted 0.5 : 1 for vision : text.


4. Post-Training Magic: RL That Knows Beauty, OCR, Physics

Core question: “How do you refine a generic generative model into a polished product?”

One-sentence answer: Large-scale reinforcement learning with a multi-component reward signal—general aesthetics, task-specific metrics (OCR accuracy, face similarity, layout fidelity), all normalized and combined into one scalar reward.

  • 🍄
    RL algorithm: Group Relative Policy Optimisation (GRPO), batch 640, lr 1×10⁻⁶.
  • 🍄
    Prompt pool: 100 k high-quality, 58 k X2I, 50 k T2I, plus 1 k human preference samples.
  • 🍄
    Reward components:
    • 🍄
      CLIP & SigLIP alignment
    • 🍄
      Aesthetic scorer
    • 🍄
      OCR & layout reward for text-rich generation
    • 🍄
      Face identity for human edits
    • 🍄
      Physics & object consistency judged by VLM probes

Scenario:
Prompt: “Replace the coffee mug with a glass of iced tea, keep hand pose.”
Without RL, fingers often mutate; with RL the hand IoU reward term pushes finger shape fidelity from 0.71 → 0.87 in internal tests.

Author’s reflection: Reward engineering felt like tuning a 10-band equaliser—boost OCR too much and landscapes over-saturate; crank face similarity alone and colours drift. The sweet spot sat at equal weight after min-max normalisation.


5. DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image

Core question: “Can autoregressive ever be fast enough for production?”

One-sentence answer: Discrete Diffusion Adaptation keeps the pretrained weights frozen, adds lightweight adapter layers, and turns sequential visual decoding into parallel denoising—20× faster with no drop in benchmark scores.

  • 🍄
    Training data: Self-distillation on 3 B image-text tokens; noisy image tokens attend bidirectionally, text stays causal.
  • 🍄
    Scheduler: Finite-State Machine inside FlagScale manages text-phase ↔ image-phase kernel launches and memory pre-allocation.
  • 🍄
    Quantisation: FP8 ready, 50 % kernel latency cut on 4-GPU node.

Scenario:
A marketing team needs 500 banner variants (product + slogan) by end-of-day. DiDA lets one 8-GPU node finish the job in under 45 min; vanilla AR would take overnight.


6. Native Abilities in Action

6.1 Text-to-Image & Text-Rich Generation

  • 🍄
    TIIF-Bench mini: best average score among open and closed models.
  • 🍄
    LeX-Bench (hard): 0.87 vs Gemini 2.5 Flash 0.74.
  • 🍄
    Outputs accurate Chinese poems, mathematical formulas, dense signage.

6.2 Any-to-Image (X2I) Editing

Single model handles: add/remove object, material swap, pose change, style transfer, background replacement, multi-subject consistency.
Benchmark highlights:
ImgEdit overall 4.41, GEdit-Bench 7.59, OmniContext 8.82, ICE-Bench 0.637—leading or on par with Gemini 2.5 Flash & GPT-image-1.

Use-case:
E-commerce vendor uploads dress photo + text “put the same dress on a model walking in Paris at sunset” → one prompt, one model, 12 s, ready for listing.

6.3 Visual Narrative

Zero-shot creation of 5–12 panel stories, characters locked, style free (anime, vintage photo, 3-D render). Automatic global & local chain-of-thought annotations keep plot coherent.

6.4 Visual Guidance

From a single reference image the model writes and illustrates a step-by-step tutorial: cooking, DIY, software workflows. Evaluated on 960 k real-world instruction clips; wins 51.5 % against Gemini in blind preference test.

6.5 World Exploration

Two modes:

  • 🍄
    User-Interactive: Each text command moves camera; model returns new view + narration.
  • 🍄
    Free-Exploration: Model auto-generates continuous walk-through.
    Maintains spatial consistency over 40+ steps; win-rate 65.5 % vs Gemini.

6.6 Embodied Manipulation

Decomposes long-horizon tasks into language instructions + key-frame images. Tested on 331 synthetic and real robot scenes; beats Gemini on physical plausibility and background consistency.


7. Local Inference: 30-Minute Walk-Through

Step 1 — Clone & Install

git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation

Step 2 — Edit config

# configs/config.py
model_path = "BAAI/Emu3.5"
vq_path    = "BAAI/Emu3.5-VisionTokenizer"
task_type  = "t2i"        # or x2i, story, howto, explore, vla
use_image  = False        # True if you supply reference images
sampling_params = dict(
    temperature=0.7,
    cfg=3.0,
    top_p=0.9
)

Step 3 — Run

python inference.py --cfg configs/config.py
# outputs/<exp_name>/proto/*.proto

Step 4 — Visualise

python src/utils/vis_proto.py \
  --input outputs/demo/proto/00001.proto \
  --output viz/

Hardware note: ≥2 GPUs recommended; FlagScale handles tensor & context parallelism automatically.


8. Benchmarks at a Glance

Task Dataset Metric Emu3.5 Best Competitor Gap
T2I GenEval overall 0.86 FLUX.1 dev 0.71 +21 %
Text render LeX hard recall 0.87 Gemini 2.5 0.74 +18 %
Editing ImgEdit overall 4.41 Gemini 2.5 4.28 +3 %
Subject-driven OmniContext avg 8.82 GPT-4o 8.80 +0.02
Story OpenING win % 49.2 Gemini 2.5 40.5 +8.7
Guidance internal win % 51.5 Gemini 2.5 39.0 +12.5
World exploration internal win % 65.5 Gemini 2.5 34.5 +31
Robot plan internal win % 67.1 Gemini 2.5 30.5 +36.6

9. Limitations & What’s Next

  • 🍄
    Token compression still ~1 k for 512×512; target 256.
  • 🍄
    Video clips limited to 5 s; longer narratives need sparser key-frame scheduling.
  • 🍄
    Outdoor robot data scarce; sim-to-real gap remains.
  • 🍄
    Chinese long-text rendering slightly behind English—more OCR reward data planned.

10. Action Checklist

  1. Check GPU count (≥2 recommended).
  2. git clone repo, install wheels including flash-attn.
  3. Download weights from Hugging Face (BAAI/Emu3.5, BAAI/Emu3.5-VisionTokenizer).
  4. Pick task in config: t2i, x2i, story, howto, explore, vla.
  5. Set use_image=True if you supply reference pic.
  6. Launch inference.py; monitor protobuf outputs.
  7. Convert proto to jpg/png with vis_proto.py.
  8. Tweak cfg 2–5 and temperature 0.5–1.0 for quality vs diversity.
  9. For speed-critical prod, request DiDA checkpoint and enable FP8.
  10. Keep prompts < 800 tokens; leave 2 k tokens for visual output to avoid truncation.

11. One-Page Overview

  • 🍄
    Emu3.5 = one 34 B autoregressive model that eats and produces mixed image-text sequences.
  • 🍄
    Trained on 13 T tokens (video, image, text) with standard next-token loss; vision loss weighted 0.5.
  • 🍄
    RL stage uses blended rewards (aesthetic, OCR, face, physics) to polish generation.
  • 🍄
    DiDA adapter delivers 20× speed-up; 1024×1024 image in ~10 s on 4×A100.
  • 🍄
    Zero-shot abilities: T2I, X2I editing, visual stories, step-by-step tutorials, world exploration, robot task plans.
  • 🍄
    Open-source: weights, tokenizer, inference stack under permissive licence.
  • 🍄
    Hardware: fp16 needs 2×40 GB; fp8 fits 2×24 GB.
  • 🍄
    Leader-boards: leads or matches Gemini 2.5 Flash, FLUX.1, GPT-image-1 on core image tasks and surpasses them on interleaved generation.

12. FAQ

Q1. Do I need to retrain for my domain images?
No. X2I and story tasks are zero-shot; provide reference images and text prompts.

Q2. Is commercial use allowed?
Yes, the licence is MIT-compatible, but ensure your input data complies with local rules.

Q3. How big is the download?
~70 GB for the 34 B model, ~2 GB for the tokenizer, ~1 GB code.

Q4. Can DiDA run on a single GPU?
Technically yes, but 20× speed-up shows best with ≥4 GPUs and FlagScale hybrid parallelism.

Q5. What’s the max output resolution?
2048×2048 px tested; 4096 px works but needs >8 k visual tokens—slow on AR, okay on DiDA.

Q6. Does it do video generation?
Up to 5 s clips at 720 p via diffusion-based video decoder conditioned on key-frame tokens—still research stage.

Q7. How do I control randomness?
Use temperature 0.3–1.0 and top_p 0.9; classifier-free guidance cfg 2–5 for image fidelity.

Q8. Is there a smaller model?
Not yet; the team plans 8 B and 3 B distillations later this year.

Exit mobile version