“Here’s my passport photo—turn it into a 4-second Tokyo night-rain scene, 24 fps, no budget.”
If that request sounds familiar, the engineering story below is worth frame-by-frame inspection.
The Identity Problem No One Has Solved (Yet)
Text-to-video models got stunningly good at motion, yet one stubborn artifact refuses to behave: a human face.
-
DreamBooth fans fine-tune 10 GB weights—motion turns to PowerPoint. -
Frame-by-frame stylists melt GPUs and still twitch the chin. -
Copy-paste crews swap backgrounds, but the first head-turn shatters the illusion.
Lynx’s take? Keep the giant frozen, clip on two tiny cheat-sheets.
An ID-Adapter memorizes the五官 (facial features), a Ref-Adapter memorizes pores and lighting. Zero base-model tuning, yet the actor can stroll from Shibuya crossing to Sahara sunset without a frame drop.
Architecture at a Glance: 14 B Frozen + 2.8 % Learnable
Module | Job | Params | Trainable |
---|---|---|---|
Wan2.1-DiT 14 B | Spatio-temporal generation | 14 B | ❄️ Frozen |
ID-Adapter | 512-D face → 16 identity tokens | 180 M | ✅ |
Ref-Adapter | VAE detail injection per layer | 220 M | ✅ |
Mental formula:
Frozen backbone = generic motion; plug-in adapters = face stays un-morphed.
Dual adapters slide into every DiT block, cross-attending on par with text tokens.
How 16 Tokens Remember a Whole Face
-
Passport photo → ArcFace → 512-D vector. -
Perceiver Resampler (a.k.a. Q-Former) learns 16 queries → 16 × 5120 matrix. -
Append 16 register tokens (anti-overfit trick) → 32 × 5120. -
Cross-attend with video patches every block.
Why compress?
Cross-attention is O(L·N). Shorter L = cheaper VRAM. Sixteen tokens encode 50 k-pixel semantics and add only 4 % latency at inference.
Ref-Adapter: Shipping Pore-Level Detail
-
Reference image → VAE encoder → 8× downsampled latent. -
Feed latent into a frozen DiT copy (noise = 0, text = “face”) → multi-scale features. -
Main branch cross-attends to those maps, pixel-aligning freckles, stubble, specular highlights.
Everything stays in latent space; VRAM footprint ¼ of pixel-based schemes.
Training Recipe: First Learn to Look, Then Learn to Move
Stage | Data | Mission | Iters | LR | Trick |
---|---|---|---|---|---|
① Image pre-train | 21.5 M stills | Make face recognizable | 40 k | 1e-4 | Warm-start Resampler from InstantID |
② Video fine-tune | 28.7 M clips | Add motion | 60 k | 5e-5 | NaViIT zero-padding bucketing |
③ Augment | 50 M pairs | De-bias lighting & expr. | — | — | X-Nemo expression + LBM relighting |
NaViIT in one sentence:
Concatenate variable-length, variable-resolution videos into one long token sequence; apply 3D-RoPE so each clip knows its own space-time coordinates—zero crop, zero black bars.
Hands-On: Install → Generate in 30 min
1. Environment
conda create -n lynx python=3.10 -y && conda activate lynx
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1 # cuts VRAM by 30 %
2. Pull Models
git lfs install
# base DiT
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers \
models/Wan2.1-T2V-14B-Diffusers
# adapters
git clone https://huggingface.co/ByteDance/lynx models/lynx_full
3. One-liner Cinematic Shot
python infer.py \
--subject_image me.png \
--prompt "A street chef grilling squid at a rainy night market, smoke swirling, camera pulls back from close-up to full body" \
--seed 42 \
--out_len 121 # 5 s @ 24 fps
RTX 4090: 12 GB VRAM, ≈2 min for 720 p, 5-second clip.
Numbers That Matter (Benchmark vs. SOTA)
Model | Face Sim.↑ | Prompt↑ | Motion↑ | Overall Quality↑ |
---|---|---|---|---|
SkyReels-A2 | 0.706 | 0.471 | 0.824 | 0.870 |
VACE | 0.586 | 0.691 | 0.851 | 0.935 |
Phantom | 0.671 | 0.690 | 0.828 | 0.888 |
Lynx (ours) | 0.753 | 0.722 | 0.837 | 0.956 |
Face similarity averaged across facexlib, insightface & in-house encoder; prompt & quality scored by Gemini-2.5-Pro.
FAQ: Everything Engineers Ask First
Q: Will an angled selfie break the system?
A: ArcFace tolerates ±45° pitch, ±60° yaw. Extreme profile (>90°) drops accuracy—use frontal or slight-angle shots.
Q: Multi-character shots?
A: Not yet. Single-ID only. ByteDance roadmap lists multi-ID branch for Q4 2025; alpha already internal.
Q: Commercial licence?
A: Weights are Apache 2.0—commercial use OK. You still need portrait rights for the input face; ByteDance offers no legal shield.
Q: VRAM starvation?
A: Use lynx_lite
(no Ref-Adapter). 6 GB VRAM, 24 fps, ~6 % lower face fidelity, still beats SkyReels-A2.
Closing Thought: When Faces Stop Being Bottlenecks
Lynx turns personalized video from a fine-tuning marathon into an inference snack. Creators can worry about story, not GPU thermals.
Next stops—lip-synced audio, multi-actor scenes, real-time distilled models—could finally usher in the one-person production studio.
So when your boss next drops a passport photo and a wild prompt, you can casually reply:
“Give me two minutes and a sentence—VFX included.”