“Here’s my passport photo—turn it into a 4-second Tokyo night-rain scene, 24 fps, no budget.”
If that request sounds familiar, the engineering story below is worth frame-by-frame inspection.


The Identity Problem No One Has Solved (Yet)

Text-to-video models got stunningly good at motion, yet one stubborn artifact refuses to behave: a human face.

  • DreamBooth fans fine-tune 10 GB weights—motion turns to PowerPoint.
  • Frame-by-frame stylists melt GPUs and still twitch the chin.
  • Copy-paste crews swap backgrounds, but the first head-turn shatters the illusion.

Lynx’s take? Keep the giant frozen, clip on two tiny cheat-sheets.
An ID-Adapter memorizes the五官 (facial features), a Ref-Adapter memorizes pores and lighting. Zero base-model tuning, yet the actor can stroll from Shibuya crossing to Sahara sunset without a frame drop.


Architecture at a Glance: 14 B Frozen + 2.8 % Learnable

Module Job Params Trainable
Wan2.1-DiT 14 B Spatio-temporal generation 14 B ❄️ Frozen
ID-Adapter 512-D face → 16 identity tokens 180 M
Ref-Adapter VAE detail injection per layer 220 M

Mental formula:
Frozen backbone = generic motion; plug-in adapters = face stays un-morphed.

Lynx diagram
Dual adapters slide into every DiT block, cross-attending on par with text tokens.


How 16 Tokens Remember a Whole Face

  1. Passport photo → ArcFace → 512-D vector.
  2. Perceiver Resampler (a.k.a. Q-Former) learns 16 queries → 16 × 5120 matrix.
  3. Append 16 register tokens (anti-overfit trick) → 32 × 5120.
  4. Cross-attend with video patches every block.

Why compress?
Cross-attention is O(L·N). Shorter L = cheaper VRAM. Sixteen tokens encode 50 k-pixel semantics and add only 4 % latency at inference.


Ref-Adapter: Shipping Pore-Level Detail

  • Reference image → VAE encoder → 8× downsampled latent.
  • Feed latent into a frozen DiT copy (noise = 0, text = “face”) → multi-scale features.
  • Main branch cross-attends to those maps, pixel-aligning freckles, stubble, specular highlights.

Everything stays in latent space; VRAM footprint ¼ of pixel-based schemes.


Training Recipe: First Learn to Look, Then Learn to Move

Stage Data Mission Iters LR Trick
① Image pre-train 21.5 M stills Make face recognizable 40 k 1e-4 Warm-start Resampler from InstantID
② Video fine-tune 28.7 M clips Add motion 60 k 5e-5 NaViIT zero-padding bucketing
③ Augment 50 M pairs De-bias lighting & expr. X-Nemo expression + LBM relighting

NaViIT in one sentence:
Concatenate variable-length, variable-resolution videos into one long token sequence; apply 3D-RoPE so each clip knows its own space-time coordinates—zero crop, zero black bars.


Hands-On: Install → Generate in 30 min

1. Environment

conda create -n lynx python=3.10 -y && conda activate lynx
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1  # cuts VRAM by 30 %

2. Pull Models

git lfs install
# base DiT
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers \
        models/Wan2.1-T2V-14B-Diffusers
# adapters
git clone https://huggingface.co/ByteDance/lynx models/lynx_full

3. One-liner Cinematic Shot

python infer.py \
  --subject_image me.png \
  --prompt "A street chef grilling squid at a rainy night market, smoke swirling, camera pulls back from close-up to full body" \
  --seed 42 \
  --out_len 121   # 5 s @ 24 fps

RTX 4090: 12 GB VRAM, ≈2 min for 720 p, 5-second clip.


Numbers That Matter (Benchmark vs. SOTA)

Model Face Sim.↑ Prompt↑ Motion↑ Overall Quality↑
SkyReels-A2 0.706 0.471 0.824 0.870
VACE 0.586 0.691 0.851 0.935
Phantom 0.671 0.690 0.828 0.888
Lynx (ours) 0.753 0.722 0.837 0.956

Face similarity averaged across facexlib, insightface & in-house encoder; prompt & quality scored by Gemini-2.5-Pro.


FAQ: Everything Engineers Ask First

Q: Will an angled selfie break the system?
A: ArcFace tolerates ±45° pitch, ±60° yaw. Extreme profile (>90°) drops accuracy—use frontal or slight-angle shots.

Q: Multi-character shots?
A: Not yet. Single-ID only. ByteDance roadmap lists multi-ID branch for Q4 2025; alpha already internal.

Q: Commercial licence?
A: Weights are Apache 2.0—commercial use OK. You still need portrait rights for the input face; ByteDance offers no legal shield.

Q: VRAM starvation?
A: Use lynx_lite (no Ref-Adapter). 6 GB VRAM, 24 fps, ~6 % lower face fidelity, still beats SkyReels-A2.


Closing Thought: When Faces Stop Being Bottlenecks

Lynx turns personalized video from a fine-tuning marathon into an inference snack. Creators can worry about story, not GPU thermals.
Next stops—lip-synced audio, multi-actor scenes, real-time distilled models—could finally usher in the one-person production studio.

So when your boss next drops a passport photo and a wild prompt, you can casually reply:
“Give me two minutes and a sentence—VFX included.”