Live Avatar AI: How We Reached 20 FPS Real-Time Streaming with a 14B-Parameter Model

高效码农

2 months ago

LiveAvatar under the hood: how a 14-billion-parameter diffusion model now runs live, lip-synced avatars at 20 FPS on five GPUs

A plain-language walk-through of the paper, code and benchmarks—no hype, no hidden plugs.

“We want an avatar that can talk forever, look like the reference photo, and run in real time.”
—Authors’ opening line, arXiv:2512.04677

1. The problem in one sentence

Big diffusion models give great faces, but they are slow (0.25 FPS) and drift out of look after a few hundred frames. LiveAvatar keeps the quality, removes the lag, and stops the drift—so you can stream an avatar for three hours without the face turning into soup.

2. Quick glance at the results

What we care about	Old best (Wan-S2V)	LiveAvatar
Speed	0.25 FPS	20.88 FPS
Longest clean run	~4 min	>10 000 s
GPUs needed	1×A100	5×H800
Sampling steps	80	4
Resolution	720×400	same
Lip-sync (Sync-C)	5.89	5.69 (close)
Face quality (IQA)	4.29	4.35

3. Why is real-time hard?

Diffusion is serial: step 2 needs step 1 finished first.
Big models are memory hogs: 14 B weights = 28 GB just to store.
Long videos forget the first face: called identity drift.

LiveAvatar attacks all three with one design idea: make the maths pipeline-shaped, not ladder-shaped.

4. The three new tricks (in pictures)

4.1 Timestep-Forcing Pipeline Parallelism (TPP)

Think of a car assembly line:

GPU	Permanent job	Data in	Data out
0	t4→t3	noise	partly clean
1	t3→t2	partly clean	cleaner
2	t2→t1	cleaner	almost clean
3	t1→t0	almost clean	clean latent
4	VAE decode	latent	video frame

Each card repeats the same tiny task every 50 ms.
No card waits for the whole chain; throughput = single-step time.
KV-cache stays local, so PCIe traffic is only a 720×400×4-byte latent—negligible.

Outcome: 4-step diffusion, 20 FPS, 5 cards.

4.2 Rolling Sink Frame Mechanism (RSFM)

Two drift cures inside one moving photo:

Adaptive Attention Sink (AAS)
- Frame 0 is the real photo only for the first block.
- After that we swap in the model’s own first generated frame (latent space).
- Result: appearance cue always lives inside the model’s own distribution—no colour shift.
Rolling RoPE
- Transformer position code is recalculated each block so the sink’s “distance” to current tokens stays the same as in training.
- Face geometry stays locked even after 40 k RoPE positions (≈10 000 s).

4.3 History-Corrupt Noise Injection

During training we randomly spoil 10 % of the KV cache channels with Gaussian noise.

Forces the net to trust the sink frame more than distant past.
Stops error snowballing when you stream for hours.

5. Training recipe (no secrets)

Stage	Goal	Steps	GPUs	Days
1. Diffusion-Forcing pre-training	Stable initial weights	25 k	128 H800	~10
2. Self-Forcing DMD distillation	4-step student	2.5 k	same	~2
Total		27.5 k		≈500 GPU-days

Resolution fixed at 720×400, 84 frames per clip.
Block size = 3 frames; KV-cache holds 4 blocks; one sink frame.
LoRA r=128, α=64 keeps memory polite.

Data: 400 k clean clips from AVSpeech (>10 s each).
Test: new GenBench-Short (100 clips) and GenBench-Long (15 clips >5 min) with humans, cartoons and half-body shots.

6. How good is it really?

6.1 Short clips (~10 s)

LiveAvatar trades 0.08 Sync-C points for +20 FPS—a bargain for live uses.

6.2 Long clips (~7 min)

Metric	OmniAvatar	Wan-S2V	LiveAvatar
ASE (aesthetic)	2.36	2.63	3.38
IQA (quality)	2.86	3.99	4.73
Dino-S (identity)	0.66	0.80	0.94

Visual side-by-side in the paper shows competitors start to blur or yellow; LiveAvatar face stays sharp.

6.3 10 000-second stress test

We looped the 7-min audio to 166 min and kept generating.
ASE/IQA numbers within ±0.02 across 0 s, 100 s, 1 000 s, 10 000 s segments—no measurable drift.

7. Ablation: keep part, drop part

Variant	FPS	TTFF (s)	ASE (7 min)
Remove TPP	4.3	3.88	3.13
Remove DMD (80 steps)	0.29	45.5	3.40
Remove VAE parallel	10.2	4.73	3.44
Full system	20.88	2.89	3.38

TTFF = time-to-first-frame, includes VAE decode and random frame alignment.

8. User study (20 volunteers, double blind)

Model	Naturalness	Sync (holistic)	Consistency
OmniAvatar	71.1	78.5	90.8
Wan-S2V	84.3	85.2	92.0
LiveAvatar	86.3	80.6	91.1

Humans preferred LiveAvatar’s overall natural feel even when lip-score metrics were slightly lower—proof that chasing only Sync-C can overcook mouth motion.

9. Install and run (copy-paste ready)

Hardware: 5×NVIDIA H800 (80 GB) or wait for the upcoming 4090 branch.
Software: Ubuntu 22.04, CUDA 12.4, Python 3.10.

# 1. Create env
conda create -n liveavatar python=3.10 -y
conda activate liveavatar

# 2. CUDA toolkit
conda install nvidia/label/cuda-12.4.1::cuda -y
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit -y

# 3. PyTorch + FlashAttention
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.8.3 --no-build-isolation

# 4. Other deps
pip install -r requirements.txt

# 5. Download weights (~30 GB)
git lfs install
git clone https://huggingface.co/Quark-Vision/Live-Avatar checkpoints/

# 6. Run infinite stream
bash scripts/infinite_inference.sh \
  --audio my_voice.wav \
  --image my_face.jpg \
  --text "Welcome to the live stream." \
  --outdir ./out \
  --fps 20

First 20 frames fill the pipeline; then player receives live 720×400 mp4 chunks.
Press Ctrl-C to stop; output already playable.

10. FAQs from early testers

Q1: Can I run this on two RTX 4090 cards today?
A: Not yet. The repo’s 4-step model is optimised for H800. A 3-step + INT8 build is scheduled for late December; it targets 15 FPS on 2×4090—watch the GitHub release page.

Q2: Why only 720×400?
A: VAE decode time grows linearly with pixel count. The team is testing SVD-quantised VAE and tiled decoding to hit 1080p without losing 20 FPS.

Q3: Does it work for singing or cartoon characters?
A: Yes. GenBench includes both. Cartoon faces actually score higher on ASE because colour blocks are easier to keep consistent.

Q4: Is the code trainable or inference-only?
A: Inference is open now. Training scripts (FSDP + DMD) will be released next month; the paper gives exact hyper-parameters if you cannot wait.

Q5: What about misuse?
A: The licence bans impersonation without consent; each frame carries an invisible watermark; hash checksums are logged. See Ethics section in the paper.

11. Roadmap (copied from repo)

Early December (now)

✅ Paper & project page
✅ Demo site
⬜ Gradio demo
⬜ 4090 / A100 3-step build
⬜ ComfyUI node

Q1 2026

1.3B student model (mobile target)
TTS integration (CosyVoice 2)
1080p tiled VAE

12. Key terms glossary (plain English)

Diffusion model: a noise-to-image network that removes grain step by step.
DMD (Distribution-Matching Distillation): teacher-student trick to shrink 80 steps into 4.
KV-cache: past frame features the model re-uses so it doesn’t recompute everything.
RoPE: position code that tells the transformer where a token sits in time; Rolling-RoPE keeps the anchor always “next door”.
Sink frame: the single reference photo (or latent) the model keeps looking back at to remember the face.
TTFF: time from pressing “start” to the first complete picture; critical for live apps.

13. Take-away in one breath

LiveAvatar shows that if you:

Chop the diffusion chain into fixed GPU chores,
Keep the face reference inside the model’s own distribution, and
Add a dash of noise to the memory,

then a 14-billion-parameter model can stream real-time, lip-synced, infinitely long avatars on five GPUs without breaking a sweat. The code is already on GitHub; the hard numbers above help you judge whether it fits your product or research next step.