Emu3.5 Explained: One Model That Generates Images, Text, and Worlds

高效码农

2 months ago

★Emu3.5 in Plain English: One Autoregressive Model for Images, Text, and World Simulation★

“

What’s the big deal?
Emu3.5 treats images, text, and video frames as one long token stream and learns to predict the next token—nothing else. The result is a single checkpoint that can chat, draw, edit, tell stories, give step-by-step visual tutorials, explore imaginary worlds, and even plan robot actions—without any task-specific heads.

Quick Glance
Why “Next Token” Works for Pictures
Training Diet: 13 Trillion Multimodal Tokens
Post-Training Magic: RL That Knows Beauty, OCR, Physics
DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image
Native Abilities in Action
Local Inference: 30-Minute Walk-Through
Benchmarks at a Glance
Limitations & What’s Next
Action Checklist
One-Page Overview
FAQ

1. Quick Glance

Spec	Value
Parameters	34.1 B (31.2 B transformer, 2.9 B embeddings)
Visual tokens	131 k code-book, 16× down-sampling
Text tokens	151 k (Qwen base)
Context length	32 768
Pre-train data	13 T tokens (10 T stage-1 + 3 T stage-2)
RL data	100 k prompts across 7 task families
Inference speed	20× faster with DiDA (≈ 10 s for 1024×1024)
Output resolution	Up to 2048×2048 px, any aspect ratio
Code & weights	Fully open-source, MIT-style licence

2. Why “Next Token” Works for Pictures

Core question: “How can predicting the next discrete token create a coherent image?”

One-sentence answer: Every image is first quantized into a short sequence of visual tokens; once the model learns to predict the next visual token just like the next word, generation becomes a unified streaming problem.

🍄

Tokenizer: IBQ framework, 455 M params, 16× down-sampling → 512×512 image ≈ 1 k tokens.
🍄

Compression vs. quality: Vanilla decoder already beats earlier 4×-larger codes; optional diffusion decoder doubles resolution and restores fine text.
🍄

Attention mask: Causal for text, causal for inter-image relations, but bidirectional inside one image during DiDA—parallel refinement without breaking autoregressive property across modalities.

Scenario:
I give the model a 256×256 crop of a busy kitchen and the text:
“Continue the comic in vintage style: the chef flips pancakes, flame jumps up.”
Emu3.5 streams 2 k new tokens; 8 s later I receive a 4-panel comic, chef silhouette and flame hue perfectly style-consistent. No ControlNet, no U-Net, just next-token.

Author’s reflection: The first time I saw a single loss curve drop for both Shakespeare and 4 k images, I finally believed modality is just an illusion of format.

3. Training Diet: 13 Trillion Multimodal Tokens

Core question: “What exactly was fed into the model?”

One-sentence answer: 63 M long videos, 500 M image-text pairs, 27 M any-to-image samples, and 3 T plain text tokens—heavily filtered, re-captioned, and packed into 32 k-token windows.

3.1 Video-Interleaved Corpus (≈ 55 % of tokens)

🍄

Average 6.5 min clips, 0.27 keyframes/s, Whisper-v2 transcripts aligned to 0.1 s.
🍄

Basic filters: remove < 480 p, talking-head clips, multi-lingual or silent outliers.
🍄

Advanced filters: DeQA perceptual score, DINO feature de-duplication, LLM text grade.
🍄

Second-stage enrichment: scene summaries, visual captions, multimodal abstracts.

Scenario:
A 3-min YouTube cooking excerpt becomes 48 keyframes + 640 text tokens. During training the model sees:
[TEXT]"add olive oil" → [IMAGE: pan] → [TEXT]”swirl to coat” → [IMAGE: shiny surface]…
It learns temporal causality—useful later for Visual Guidance and robot key-frame planning.

3.2 Paired & Synthetic Data

🍄

Image-text pairs re-captioned by Qwen2.5-VL for denser nouns and attributes.
🍄

Synthetic T2I outputs from open-source diffusion models augment artistic styles.
🍄

Video-text pairs motion-scored to keep dynamic clips; packed chronologically to mimic long sequences.

3.3 Text-Only (≈ 18 %)

Keeps linguistic reasoning alive while visual tokens dominate compute; loss re-weighted 0.5 : 1 for vision : text.

4. Post-Training Magic: RL That Knows Beauty, OCR, Physics

Core question: “How do you refine a generic generative model into a polished product?”

One-sentence answer: Large-scale reinforcement learning with a multi-component reward signal—general aesthetics, task-specific metrics (OCR accuracy, face similarity, layout fidelity), all normalized and combined into one scalar reward.

🍄

RL algorithm: Group Relative Policy Optimisation (GRPO), batch 640, lr 1×10⁻⁶.
🍄

Prompt pool: 100 k high-quality, 58 k X2I, 50 k T2I, plus 1 k human preference samples.
🍄
Reward components:
- 🍄
  
  CLIP & SigLIP alignment
- 🍄
  
  Aesthetic scorer
- 🍄
  
  OCR & layout reward for text-rich generation
- 🍄
  
  Face identity for human edits
- 🍄
  
  Physics & object consistency judged by VLM probes

Scenario:
Prompt: “Replace the coffee mug with a glass of iced tea, keep hand pose.”
Without RL, fingers often mutate; with RL the hand IoU reward term pushes finger shape fidelity from 0.71 → 0.87 in internal tests.

Author’s reflection: Reward engineering felt like tuning a 10-band equaliser—boost OCR too much and landscapes over-saturate; crank face similarity alone and colours drift. The sweet spot sat at equal weight after min-max normalisation.

5. DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image

Core question: “Can autoregressive ever be fast enough for production?”

One-sentence answer: Discrete Diffusion Adaptation keeps the pretrained weights frozen, adds lightweight adapter layers, and turns sequential visual decoding into parallel denoising—20× faster with no drop in benchmark scores.

🍄

Training data: Self-distillation on 3 B image-text tokens; noisy image tokens attend bidirectionally, text stays causal.
🍄

Scheduler: Finite-State Machine inside FlagScale manages text-phase ↔ image-phase kernel launches and memory pre-allocation.
🍄

Quantisation: FP8 ready, 50 % kernel latency cut on 4-GPU node.

Scenario:
A marketing team needs 500 banner variants (product + slogan) by end-of-day. DiDA lets one 8-GPU node finish the job in under 45 min; vanilla AR would take overnight.

6. Native Abilities in Action

6.1 Text-to-Image & Text-Rich Generation

🍄

TIIF-Bench mini: best average score among open and closed models.
🍄

LeX-Bench (hard): 0.87 vs Gemini 2.5 Flash 0.74.
🍄

Outputs accurate Chinese poems, mathematical formulas, dense signage.

6.2 Any-to-Image (X2I) Editing

Single model handles: add/remove object, material swap, pose change, style transfer, background replacement, multi-subject consistency.
Benchmark highlights:
ImgEdit overall 4.41, GEdit-Bench 7.59, OmniContext 8.82, ICE-Bench 0.637—leading or on par with Gemini 2.5 Flash & GPT-image-1.

Use-case:
E-commerce vendor uploads dress photo + text “put the same dress on a model walking in Paris at sunset” → one prompt, one model, 12 s, ready for listing.

6.3 Visual Narrative

Zero-shot creation of 5–12 panel stories, characters locked, style free (anime, vintage photo, 3-D render). Automatic global & local chain-of-thought annotations keep plot coherent.

6.4 Visual Guidance

From a single reference image the model writes and illustrates a step-by-step tutorial: cooking, DIY, software workflows. Evaluated on 960 k real-world instruction clips; wins 51.5 % against Gemini in blind preference test.

6.5 World Exploration

Two modes:

🍄

User-Interactive: Each text command moves camera; model returns new view + narration.
🍄

Free-Exploration: Model auto-generates continuous walk-through.
Maintains spatial consistency over 40+ steps; win-rate 65.5 % vs Gemini.

6.6 Embodied Manipulation

Decomposes long-horizon tasks into language instructions + key-frame images. Tested on 331 synthetic and real robot scenes; beats Gemini on physical plausibility and background consistency.

7. Local Inference: 30-Minute Walk-Through

Step 1 — Clone & Install

git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation

Step 2 — Edit config

# configs/config.py
model_path = "BAAI/Emu3.5"
vq_path    = "BAAI/Emu3.5-VisionTokenizer"
task_type  = "t2i"        # or x2i, story, howto, explore, vla
use_image  = False        # True if you supply reference images
sampling_params = dict(
    temperature=0.7,
    cfg=3.0,
    top_p=0.9
)

Step 3 — Run

python inference.py --cfg configs/config.py
# outputs/<exp_name>/proto/*.proto

Step 4 — Visualise

python src/utils/vis_proto.py \
  --input outputs/demo/proto/00001.proto \
  --output viz/

Hardware note: ≥2 GPUs recommended; FlagScale handles tensor & context parallelism automatically.

8. Benchmarks at a Glance

Task	Dataset	Metric	Emu3.5	Best Competitor	Gap
T2I	GenEval	overall	0.86	FLUX.1 dev 0.71	+21 %
Text render	LeX hard	recall	0.87	Gemini 2.5 0.74	+18 %
Editing	ImgEdit	overall	4.41	Gemini 2.5 4.28	+3 %
Subject-driven	OmniContext	avg	8.82	GPT-4o 8.80	+0.02
Story	OpenING	win %	49.2	Gemini 2.5 40.5	+8.7
Guidance	internal	win %	51.5	Gemini 2.5 39.0	+12.5
World exploration	internal	win %	65.5	Gemini 2.5 34.5	+31
Robot plan	internal	win %	67.1	Gemini 2.5 30.5	+36.6

9. Limitations & What’s Next

🍄

Token compression still ~1 k for 512×512; target 256.
🍄

Video clips limited to 5 s; longer narratives need sparser key-frame scheduling.
🍄

Outdoor robot data scarce; sim-to-real gap remains.
🍄

Chinese long-text rendering slightly behind English—more OCR reward data planned.

10. Action Checklist

Check GPU count (≥2 recommended).
git clone repo, install wheels including flash-attn.
Download weights from Hugging Face (BAAI/Emu3.5, BAAI/Emu3.5-VisionTokenizer).
Pick task in config: t2i, x2i, story, howto, explore, vla.
Set use_image=True if you supply reference pic.
Launch inference.py; monitor protobuf outputs.
Convert proto to jpg/png with vis_proto.py.
Tweak cfg 2–5 and temperature 0.5–1.0 for quality vs diversity.
For speed-critical prod, request DiDA checkpoint and enable FP8.
Keep prompts < 800 tokens; leave 2 k tokens for visual output to avoid truncation.

11. One-Page Overview

🍄

Emu3.5 = one 34 B autoregressive model that eats and produces mixed image-text sequences.
🍄

Trained on 13 T tokens (video, image, text) with standard next-token loss; vision loss weighted 0.5.
🍄

RL stage uses blended rewards (aesthetic, OCR, face, physics) to polish generation.
🍄

DiDA adapter delivers 20× speed-up; 1024×1024 image in ~10 s on 4×A100.
🍄

Zero-shot abilities: T2I, X2I editing, visual stories, step-by-step tutorials, world exploration, robot task plans.
🍄

Open-source: weights, tokenizer, inference stack under permissive licence.
🍄

Hardware: fp16 needs 2×40 GB; fp8 fits 2×24 GB.
🍄

Leader-boards: leads or matches Gemini 2.5 Flash, FLUX.1, GPT-image-1 on core image tasks and surpasses them on interleaved generation.

12. FAQ

Q1. Do I need to retrain for my domain images?
No. X2I and story tasks are zero-shot; provide reference images and text prompts.

Q2. Is commercial use allowed?
Yes, the licence is MIT-compatible, but ensure your input data complies with local rules.

Q3. How big is the download?
~70 GB for the 34 B model, ~2 GB for the tokenizer, ~1 GB code.

Q4. Can DiDA run on a single GPU?
Technically yes, but 20× speed-up shows best with ≥4 GPUs and FlagScale hybrid parallelism.

Q5. What’s the max output resolution?
2048×2048 px tested; 4096 px works but needs >8 k visual tokens—slow on AR, okay on DiDA.

Q6. Does it do video generation?
Up to 5 s clips at 720 p via diffusion-based video decoder conditioned on key-frame tokens—still research stage.

Q7. How do I control randomness?
Use temperature 0.3–1.0 and top_p 0.9; classifier-free guidance cfg 2–5 for image fidelity.

Q8. Is there a smaller model?
Not yet; the team plans 8 B and 3 B distillations later this year.