How Yume1.5’s Text-Driven Engine Turns Images Into Walkable Worlds

高效码农

2 months ago

From a Single Image to an Infinite, Walkable World: Inside Yume1.5’s Text-Driven Interactive Video Engine

What is the shortest path to turning one picture—or one sentence—into a living, explorable 3D world that runs on a single GPU?
Yume1.5 compresses time, space, and channels together, distills 50 diffusion steps into 4, and lets you steer with everyday keyboard or text prompts.

1 The 30-Second Primer: How Yume1.5 Works and Why It Matters

Summary: Yume1.5 is a 5-billion-parameter diffusion model that autoregressively generates minutes-long 720p video while you walk and look around. It keeps temporal consistency by jointly compressing historical frames along time, space, and channel axes, then accelerates inference with self-forcing distillation. Text prompts can inject world events—rain, dragons, traffic—without extra training.

2 Core Question: Why Is Real-Time Interactive World Generation Still Hard?

Sub-question: What technical bottlenecks prevent previous diffusion models from letting a user wander inside a photorealistic, boundless scene?

Parameter bloat: 14B+ models need data-center GPUs.
Inference lag: 50–100 diffusion steps kill interactivity.
History explosion: every new frame conditions on all earlier ones—quadratic memory growth.
Text vacuum: keyboard/mouse-only control can’t summon new objects or weather.

Yume1.5’s paper lists these exact pain points and counters each with a dedicated module (see Sections 4.1–4.3 in the paper). The rest of this article maps those modules to hands-on practice.

3 Architecture Deep Dive: Three Tricks That Make It Fly

3.1 Joint Temporal-Spatial-Channel Modeling (TSCM)

Summary: TSCM keeps O(log t) tokens instead of O(t) by slicing old frames coarsely in time and finely in space, then squeezing channels for linear attention.

Operational example:
Frame 0 (initial) keeps full resolution (1,2,2).
Frames −2 to −7 drop to (1,4,4).
Anything older than −23 is (1,8,8).
The model still “sees” a 5-second corridor, but at 1/64 pixels when it’s far away.

Author reflection: We first tried a sliding window—world looked like amnesia after 10s. TSCM’s pyramid forgets gracefully; players feel continuity even if they can’t name it.

Code snippet (conceptual):

# inside DiT block
z_spatial = patchify(history, rate=(1,4,4))
z_linear  = patchify(history, rate=(8,4,4), out_channels=96)
z_fused   = linear_attention(z_linear, projected_qkv)
z_out     = z_spatial + upsample(z_fused)

3.2 Self-Forcing with TSCM Distillation

Summary: A student model (4 steps) mimics a teacher (50 steps) but is trained on its own noisy predictions, not ground-truth frames, closing the train-inference gap.

Scenario: You record a 30-second take; halfway through, the sky suddenly switches to sunset. Self-forcing taught the student to self-correct color drift instead of amplifying it.
Key hyper-parameters:
– Real model steps: 50
– Fake model steps: 4
– DMD loss weight: 1.0
– TSCM history length: 256 tokens (compressed from 2048)

Author reflection: We abandoned KV-cache because it still scales with token count. TSCM is cache-free; that alone shaved 400 ms per step on A100.

3.3 Text-Controlled World Events

Summary: By splitting captions into event and action clauses, the pipeline caches motion embeddings and still accepts open-ended scene descriptions.

Walk-through example:
Prompt:
```
Event: “A dragon circles the castle tower at dusk.”
Action: “W+↑→”
```
T5 encodes each clause separately; the action vector is pre-computed and tabled; the event vector is computed once and reused for 5-second chunks. Latency stays under the 16 fps budget.
Practical tip: keep action vocabulary ≤ 25 tokens; anything longer misses the cache and invokes T5 on every frame—immediate frame-drop.

4 Data Factory: What the Model Was Trained On

Summary: Three carefully mixed datasets give Yume1.5 general video quality, real-world camera control, and event-driven storytelling.

Dataset	Size	Purpose	Key Annotation
Sekai-Real-HQ	1.8 M clips	Real-world walking, camera poses	Keyboard/mouse symbols from trajectory
Synthetic-50K	50 K clips	Maintain T2V priors	VBench top-50 K out of 80 K generated
Event-4K	4 K clips	Open-vocabulary events	Manual write-ups: urban, sci-fi, fantasy, weather

Author reflection: Volunteers wrote 10 K event sentences—someone asked for “a giant teapot orbiting the skyline.” We kept it; the model later generated exactly that during alpha test. Moral: weird data survives distillation.

5 Hands-On Guide: Install, Configure, Generate

5.1 One-Shot Install (Linux)

git clone https://github.com/stdstu12/YUME.git && cd YUME
./env_setup.sh fastvideo          # creates conda env “yume”
conda activate yume
pip install -r requirements.txt
pip install .                      # editable, reflect code edits instantly

Windows users: double-click run_oneclick_debug.bat, open the printed http://127.0.0.1:7860 in Edge/Chrome.

5.2 Pull Weights

huggingface-cli download stdstu123/Yume-5B-720P --local-dir ./Yume-5B-720P
# optional: Wan-AI/Wan2.2-TI2V-5B for extra init choice

5.3 Your First Interactive Clip

Create folder ./my_images and drop living_room.jpg (≤ 1280 px).

Write caption.txt:

Modern living room leading to a sunset balcony. Camera moves forward.

Run:

bash scripts/inference/sample_jpg.sh \
  --jpg_dir="./my_images" \
  --caption_path="./caption.txt" \
  --steps=4 --resolution="720p"

Grab coffee—8 s later you get 96 frames (6 s@16fps) MP4.
Play with keyboard in Gradio UI; hit “Extend” to continue autoregressively.

5.4 Parameter Cheat-Sheet

Parameter	Sweet Spot	What If You Exceed?
steps	4–8	>20: better texture, 5× slower
Actual distance	0.5–3	>10: motion blur, broken physics
Angular change	0.5–2	>5: horizon tilts unnaturally
View rotation	0.5–2	>5: rolling-shutter artifact

Author reflection: We give users sliders, but 80% stick to defaults. The real artistic control is in the event prompt—spend your minutes there, not on step count.

6 Benchmarks: Numbers You Can Quote

Model	540p 6s Latency	Instruction-Following ↑	Aesthetic Score ↑	Remarks
Wan-2.1	611 s	0.057	0.494	Text-only, no keyboard
MatrixGame	971 s	0.271	0.435	Needs game engine data
Yume (1.0)	572 s	0.657	0.518	Keyboard only
Yume1.5 4-step	8 s	0.836	0.506	Keyboard + text events

Higher instruction-following = generated path matches commanded WASD/arrow sequence (tested on Yume-Bench).

7 Failure Modes & Work-Arounds

Vehicles slide backwards – small-model hallucination; mitigate by raising resolution to 720p or adding “drive forward slowly” in event prompt.
Crowd foot-step jitter – appears when >20 pedestrians; reduce Actual distance to ≤1 or thin out scene description.
Long-video color drift – use Self-Forcing + TSCM trained checkpoint; without it, aesthetic score drops 8% by sixth 5-second chunk.
Windows antivirus false-positive – run_oneclick_debug.bat downloads DLLs; whitelist the folder or use Linux container.

8 What’s Next? Roadmap Straight from the Lab

MoE-8B: 8 experts, 5B active, same 4-step latency, 15% quality uplift.
FP8 weights: halves VRAM; target 12 GB gaming laptops.
Physics-aware training: add optical-flow loss to cut “floating cars” rate from 3% → 0.5%.
Official Unreal/Unity plugin (UDP+JSON) Q2 2026.

9 Action Checklist / Implementation Steps

Check GPU ≥16 GB VRAM and CUDA 12.1+
Clone repo → run ./env_setup.sh fastvideo
Download Yume-5B-720P weights → place in ./Yume-5B-720P
Prepare caption.txt: one line = event + camera per image
Run sample_jpg.sh with steps=4, resolution=”720p”
Inspect first 6s; if drift, raise steps to 8 or lower angular speed <2
To extend, click “Extend” in Gradio or call API with previous frame as new init
For text-events only, switch to sample_tts.sh --T2V and omit image folder
Cache action embeddings by keeping action vocabulary ≤25 tokens
Export MP4 → pull into Premiere/Blender for final cut

10 One-Page Overview

Goal: Real-time, text + keyboardsteered, minutes-long, 720p world video from one image or sentence.
Key tech: TSCM compression, self-forcing distillation, split prompt caching.
Hardware: 16 GB VRAM GPU, CUDA 12.1, Python 3.10.
Speed: 8s to generate 6s video at 12fps on A100; 18s on RTX 4090.
Quality: instruction-following 0.836, aesthetic 0.506, beats 50-step baselines.
Next up: MoE-8B, FP8, physics loss, game-engine plugin.

11 FAQ

Q1: Can I run Yume1.5 on a 12 GB gaming card?
Yes—use 540p, steps=4, batch=1; VRAM peaks at 11.3 GB.

Q2: How long a video can I generate before quality collapses?
Tests ran 180s (30 chunks). Aesthetic score decline <5% after Self-Forcing + TSCM.

Q3: Is commercial use allowed?
Weights are MIT-licensed. Cite the paper and model card when redistributing.

Q4: Why split prompt into event vs action?
Actions are finite and cacheable; events are open-vocabulary. Splitting cuts T5 compute by 90% per frame.

Q5: Does higher step count eliminate all artifacts?
Steps>20 improve texture, but small-model physics errors (floating cars) persist; wait for MoE release.

Q6: Can I feed my own camera trajectory instead of keyboard?
Yes—replace keyboard symbols in caption.txt with your own; format is identical to Sekai dataset.

Q7: When will the Unreal plugin arrive?
Roadmap targets Q2 2026; early UDP code is already in plugin/ folder if you want to hack.

Author closing thought: We started Yume to see if “dream walkthrough” could ever be real-time. TSCM gave us memory, distillation gave us speed, and text-events gave us magic. The remaining flaws are just invitations for bigger dreams—see you inside the next chunk.