★Emu3.5 in Plain English: One Autoregressive Model for Images, Text, and World Simulation★
“
What’s the big deal?
Emu3.5 treats images, text, and video frames as one long token stream and learns to predict the next token—nothing else. The result is a single checkpoint that can chat, draw, edit, tell stories, give step-by-step visual tutorials, explore imaginary worlds, and even plan robot actions—without any task-specific heads.
Table of Contents
- 
Quick Glance 
- 
Why “Next Token” Works for Pictures 
- 
Training Diet: 13 Trillion Multimodal Tokens 
- 
Post-Training Magic: RL That Knows Beauty, OCR, Physics 
- 
DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image 
- 
Native Abilities in Action 
- 
Local Inference: 30-Minute Walk-Through 
- 
Benchmarks at a Glance 
- 
Limitations & What’s Next 
- 
Action Checklist 
- 
One-Page Overview 
- 
FAQ 
1. Quick Glance
| Spec | Value | 
|---|---|
| Parameters | 34.1 B (31.2 B transformer, 2.9 B embeddings) | 
| Visual tokens | 131 k code-book, 16× down-sampling | 
| Text tokens | 151 k (Qwen base) | 
| Context length | 32 768 | 
| Pre-train data | 13 T tokens (10 T stage-1 + 3 T stage-2) | 
| RL data | 100 k prompts across 7 task families | 
| Inference speed | 20× faster with DiDA (≈ 10 s for 1024×1024) | 
| Output resolution | Up to 2048×2048 px, any aspect ratio | 
| Code & weights | Fully open-source, MIT-style licence | 
2. Why “Next Token” Works for Pictures
Core question: “How can predicting the next discrete token create a coherent image?”
One-sentence answer: Every image is first quantized into a short sequence of visual tokens; once the model learns to predict the next visual token just like the next word, generation becomes a unified streaming problem.
- 🍄
 Tokenizer: IBQ framework, 455 M params, 16× down-sampling → 512×512 image ≈ 1 k tokens. 
- 🍄
 Compression vs. quality: Vanilla decoder already beats earlier 4×-larger codes; optional diffusion decoder doubles resolution and restores fine text. 
- 🍄
 Attention mask: Causal for text, causal for inter-image relations, but bidirectional inside one image during DiDA—parallel refinement without breaking autoregressive property across modalities. 
Scenario:
I give the model a 256×256 crop of a busy kitchen and the text:
“Continue the comic in vintage style: the chef flips pancakes, flame jumps up.”
Emu3.5 streams 2 k new tokens; 8 s later I receive a 4-panel comic, chef silhouette and flame hue perfectly style-consistent. No ControlNet, no U-Net, just next-token.
Author’s reflection: The first time I saw a single loss curve drop for both Shakespeare and 4 k images, I finally believed modality is just an illusion of format.
3. Training Diet: 13 Trillion Multimodal Tokens
Core question: “What exactly was fed into the model?”
One-sentence answer: 63 M long videos, 500 M image-text pairs, 27 M any-to-image samples, and 3 T plain text tokens—heavily filtered, re-captioned, and packed into 32 k-token windows.
3.1 Video-Interleaved Corpus (≈ 55 % of tokens)
- 🍄
 Average 6.5 min clips, 0.27 keyframes/s, Whisper-v2 transcripts aligned to 0.1 s. 
- 🍄
 Basic filters: remove < 480 p, talking-head clips, multi-lingual or silent outliers. 
- 🍄
 Advanced filters: DeQA perceptual score, DINO feature de-duplication, LLM text grade. 
- 🍄
 Second-stage enrichment: scene summaries, visual captions, multimodal abstracts. 
Scenario:
A 3-min YouTube cooking excerpt becomes 48 keyframes + 640 text tokens. During training the model sees:
[TEXT]"add olive oil" → [IMAGE: pan] → [TEXT]”swirl to coat” → [IMAGE: shiny surface]…
It learns temporal causality—useful later for Visual Guidance and robot key-frame planning.
3.2 Paired & Synthetic Data
- 🍄
 Image-text pairs re-captioned by Qwen2.5-VL for denser nouns and attributes. 
- 🍄
 Synthetic T2I outputs from open-source diffusion models augment artistic styles. 
- 🍄
 Video-text pairs motion-scored to keep dynamic clips; packed chronologically to mimic long sequences. 
3.3 Text-Only (≈ 18 %)
Keeps linguistic reasoning alive while visual tokens dominate compute; loss re-weighted 0.5 : 1 for vision : text.
4. Post-Training Magic: RL That Knows Beauty, OCR, Physics
Core question: “How do you refine a generic generative model into a polished product?”
One-sentence answer: Large-scale reinforcement learning with a multi-component reward signal—general aesthetics, task-specific metrics (OCR accuracy, face similarity, layout fidelity), all normalized and combined into one scalar reward.
- 🍄
 RL algorithm: Group Relative Policy Optimisation (GRPO), batch 640, lr 1×10⁻⁶. 
- 🍄
 Prompt pool: 100 k high-quality, 58 k X2I, 50 k T2I, plus 1 k human preference samples. 
- 🍄
 Reward components: - 🍄
 CLIP & SigLIP alignment 
- 🍄
 Aesthetic scorer 
- 🍄
 OCR & layout reward for text-rich generation 
- 🍄
 Face identity for human edits 
- 🍄
 Physics & object consistency judged by VLM probes 
 
- 🍄
Scenario:
Prompt: “Replace the coffee mug with a glass of iced tea, keep hand pose.”
Without RL, fingers often mutate; with RL the hand IoU reward term pushes finger shape fidelity from 0.71 → 0.87 in internal tests.
Author’s reflection: Reward engineering felt like tuning a 10-band equaliser—boost OCR too much and landscapes over-saturate; crank face similarity alone and colours drift. The sweet spot sat at equal weight after min-max normalisation.
5. DiDA: Waiting 10 s Instead of 200 s for a 1024×1024 Image
Core question: “Can autoregressive ever be fast enough for production?”
One-sentence answer: Discrete Diffusion Adaptation keeps the pretrained weights frozen, adds lightweight adapter layers, and turns sequential visual decoding into parallel denoising—20× faster with no drop in benchmark scores.
- 🍄
 Training data: Self-distillation on 3 B image-text tokens; noisy image tokens attend bidirectionally, text stays causal. 
- 🍄
 Scheduler: Finite-State Machine inside FlagScale manages text-phase ↔ image-phase kernel launches and memory pre-allocation. 
- 🍄
 Quantisation: FP8 ready, 50 % kernel latency cut on 4-GPU node. 
Scenario:
A marketing team needs 500 banner variants (product + slogan) by end-of-day. DiDA lets one 8-GPU node finish the job in under 45 min; vanilla AR would take overnight.
6. Native Abilities in Action
6.1 Text-to-Image & Text-Rich Generation
- 🍄
 TIIF-Bench mini: best average score among open and closed models. 
- 🍄
 LeX-Bench (hard): 0.87 vs Gemini 2.5 Flash 0.74. 
- 🍄
 Outputs accurate Chinese poems, mathematical formulas, dense signage. 
6.2 Any-to-Image (X2I) Editing
Single model handles: add/remove object, material swap, pose change, style transfer, background replacement, multi-subject consistency.
Benchmark highlights:
ImgEdit overall 4.41, GEdit-Bench 7.59, OmniContext 8.82, ICE-Bench 0.637—leading or on par with Gemini 2.5 Flash & GPT-image-1.
Use-case:
E-commerce vendor uploads dress photo + text “put the same dress on a model walking in Paris at sunset” → one prompt, one model, 12 s, ready for listing.
6.3 Visual Narrative
Zero-shot creation of 5–12 panel stories, characters locked, style free (anime, vintage photo, 3-D render). Automatic global & local chain-of-thought annotations keep plot coherent.
6.4 Visual Guidance
From a single reference image the model writes and illustrates a step-by-step tutorial: cooking, DIY, software workflows. Evaluated on 960 k real-world instruction clips; wins 51.5 % against Gemini in blind preference test.
6.5 World Exploration
Two modes:
- 🍄
 User-Interactive: Each text command moves camera; model returns new view + narration. 
- 🍄
 Free-Exploration: Model auto-generates continuous walk-through. 
 Maintains spatial consistency over 40+ steps; win-rate 65.5 % vs Gemini.
6.6 Embodied Manipulation
Decomposes long-horizon tasks into language instructions + key-frame images. Tested on 331 synthetic and real robot scenes; beats Gemini on physical plausibility and background consistency.
7. Local Inference: 30-Minute Walk-Through
Step 1 — Clone & Install
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
Step 2 — Edit config
# configs/config.py
model_path = "BAAI/Emu3.5"
vq_path    = "BAAI/Emu3.5-VisionTokenizer"
task_type  = "t2i"        # or x2i, story, howto, explore, vla
use_image  = False        # True if you supply reference images
sampling_params = dict(
    temperature=0.7,
    cfg=3.0,
    top_p=0.9
)
Step 3 — Run
python inference.py --cfg configs/config.py
# outputs/<exp_name>/proto/*.proto
Step 4 — Visualise
python src/utils/vis_proto.py \
  --input outputs/demo/proto/00001.proto \
  --output viz/
Hardware note: ≥2 GPUs recommended; FlagScale handles tensor & context parallelism automatically.
8. Benchmarks at a Glance
| Task | Dataset | Metric | Emu3.5 | Best Competitor | Gap | 
|---|---|---|---|---|---|
| T2I | GenEval | overall | 0.86 | FLUX.1 dev 0.71 | +21 % | 
| Text render | LeX hard | recall | 0.87 | Gemini 2.5 0.74 | +18 % | 
| Editing | ImgEdit | overall | 4.41 | Gemini 2.5 4.28 | +3 % | 
| Subject-driven | OmniContext | avg | 8.82 | GPT-4o 8.80 | +0.02 | 
| Story | OpenING | win % | 49.2 | Gemini 2.5 40.5 | +8.7 | 
| Guidance | internal | win % | 51.5 | Gemini 2.5 39.0 | +12.5 | 
| World exploration | internal | win % | 65.5 | Gemini 2.5 34.5 | +31 | 
| Robot plan | internal | win % | 67.1 | Gemini 2.5 30.5 | +36.6 | 
9. Limitations & What’s Next
- 🍄
 Token compression still ~1 k for 512×512; target 256. 
- 🍄
 Video clips limited to 5 s; longer narratives need sparser key-frame scheduling. 
- 🍄
 Outdoor robot data scarce; sim-to-real gap remains. 
- 🍄
 Chinese long-text rendering slightly behind English—more OCR reward data planned. 
10. Action Checklist
- 
Check GPU count (≥2 recommended). 
- 
git clonerepo, install wheels including flash-attn.
- 
Download weights from Hugging Face ( BAAI/Emu3.5,BAAI/Emu3.5-VisionTokenizer).
- 
Pick task in config: t2i,x2i,story,howto,explore,vla.
- 
Set use_image=Trueif you supply reference pic.
- 
Launch inference.py; monitor protobuf outputs.
- 
Convert proto to jpg/png with vis_proto.py.
- 
Tweak cfg2–5 and temperature 0.5–1.0 for quality vs diversity.
- 
For speed-critical prod, request DiDA checkpoint and enable FP8. 
- 
Keep prompts < 800 tokens; leave 2 k tokens for visual output to avoid truncation. 
11. One-Page Overview
- 🍄
 Emu3.5 = one 34 B autoregressive model that eats and produces mixed image-text sequences. 
- 🍄
 Trained on 13 T tokens (video, image, text) with standard next-token loss; vision loss weighted 0.5. 
- 🍄
 RL stage uses blended rewards (aesthetic, OCR, face, physics) to polish generation. 
- 🍄
 DiDA adapter delivers 20× speed-up; 1024×1024 image in ~10 s on 4×A100. 
- 🍄
 Zero-shot abilities: T2I, X2I editing, visual stories, step-by-step tutorials, world exploration, robot task plans. 
- 🍄
 Open-source: weights, tokenizer, inference stack under permissive licence. 
- 🍄
 Hardware: fp16 needs 2×40 GB; fp8 fits 2×24 GB. 
- 🍄
 Leader-boards: leads or matches Gemini 2.5 Flash, FLUX.1, GPT-image-1 on core image tasks and surpasses them on interleaved generation. 
12. FAQ
Q1. Do I need to retrain for my domain images?
No. X2I and story tasks are zero-shot; provide reference images and text prompts.
Q2. Is commercial use allowed?
Yes, the licence is MIT-compatible, but ensure your input data complies with local rules.
Q3. How big is the download?
~70 GB for the 34 B model, ~2 GB for the tokenizer, ~1 GB code.
Q4. Can DiDA run on a single GPU?
Technically yes, but 20× speed-up shows best with ≥4 GPUs and FlagScale hybrid parallelism.
Q5. What’s the max output resolution?
2048×2048 px tested; 4096 px works but needs >8 k visual tokens—slow on AR, okay on DiDA.
Q6. Does it do video generation?
Up to 5 s clips at 720 p via diffusion-based video decoder conditioned on key-frame tokens—still research stage.
Q7. How do I control randomness?
Use temperature 0.3–1.0 and top_p 0.9; classifier-free guidance cfg 2–5 for image fidelity.
Q8. Is there a smaller model?
Not yet; the team plans 8 B and 3 B distillations later this year.
