LiveAvatar under the hood: how a 14-billion-parameter diffusion model now runs live, lip-synced avatars at 20 FPS on five GPUs
A plain-language walk-through of the paper, code and benchmarks—no hype, no hidden plugs.
“We want an avatar that can talk forever, look like the reference photo, and run in real time.”
—Authors’ opening line, arXiv:2512.04677
1. The problem in one sentence
Big diffusion models give great faces, but they are slow (0.25 FPS) and drift out of look after a few hundred frames. LiveAvatar keeps the quality, removes the lag, and stops the drift—so you can stream an avatar for three hours without the face turning into soup.
2. Quick glance at the results
| What we care about | Old best (Wan-S2V) | LiveAvatar |
|---|---|---|
| Speed | 0.25 FPS | 20.88 FPS |
| Longest clean run | ~4 min | >10 000 s |
| GPUs needed | 1×A100 | 5×H800 |
| Sampling steps | 80 | 4 |
| Resolution | 720×400 | same |
| Lip-sync (Sync-C) | 5.89 | 5.69 (close) |
| Face quality (IQA) | 4.29 | 4.35 |
3. Why is real-time hard?
-
Diffusion is serial: step 2 needs step 1 finished first. -
Big models are memory hogs: 14 B weights = 28 GB just to store. -
Long videos forget the first face: called identity drift.
LiveAvatar attacks all three with one design idea: make the maths pipeline-shaped, not ladder-shaped.
4. The three new tricks (in pictures)
4.1 Timestep-Forcing Pipeline Parallelism (TPP)
Think of a car assembly line:
| GPU | Permanent job | Data in | Data out |
|---|---|---|---|
| 0 | t4→t3 | noise | partly clean |
| 1 | t3→t2 | partly clean | cleaner |
| 2 | t2→t1 | cleaner | almost clean |
| 3 | t1→t0 | almost clean | clean latent |
| 4 | VAE decode | latent | video frame |
-
Each card repeats the same tiny task every 50 ms. -
No card waits for the whole chain; throughput = single-step time. -
KV-cache stays local, so PCIe traffic is only a 720×400×4-byte latent—negligible.
Outcome: 4-step diffusion, 20 FPS, 5 cards.
4.2 Rolling Sink Frame Mechanism (RSFM)
Two drift cures inside one moving photo:
-
Adaptive Attention Sink (AAS)
-
Frame 0 is the real photo only for the first block. -
After that we swap in the model’s own first generated frame (latent space). -
Result: appearance cue always lives inside the model’s own distribution—no colour shift.
-
-
Rolling RoPE
-
Transformer position code is recalculated each block so the sink’s “distance” to current tokens stays the same as in training. -
Face geometry stays locked even after 40 k RoPE positions (≈10 000 s).
-
4.3 History-Corrupt Noise Injection
During training we randomly spoil 10 % of the KV cache channels with Gaussian noise.
-
Forces the net to trust the sink frame more than distant past. -
Stops error snowballing when you stream for hours.
5. Training recipe (no secrets)
| Stage | Goal | Steps | GPUs | Days |
|---|---|---|---|---|
| 1. Diffusion-Forcing pre-training | Stable initial weights | 25 k | 128 H800 | ~10 |
| 2. Self-Forcing DMD distillation | 4-step student | 2.5 k | same | ~2 |
| Total | 27.5 k | ≈500 GPU-days |
-
Resolution fixed at 720×400, 84 frames per clip. -
Block size = 3 frames; KV-cache holds 4 blocks; one sink frame. -
LoRA r=128, α=64 keeps memory polite.
Data: 400 k clean clips from AVSpeech (>10 s each).
Test: new GenBench-Short (100 clips) and GenBench-Long (15 clips >5 min) with humans, cartoons and half-body shots.
6. How good is it really?
6.1 Short clips (~10 s)
LiveAvatar trades 0.08 Sync-C points for +20 FPS—a bargain for live uses.
6.2 Long clips (~7 min)
| Metric | OmniAvatar | Wan-S2V | LiveAvatar |
|---|---|---|---|
| ASE (aesthetic) | 2.36 | 2.63 | 3.38 |
| IQA (quality) | 2.86 | 3.99 | 4.73 |
| Dino-S (identity) | 0.66 | 0.80 | 0.94 |
Visual side-by-side in the paper shows competitors start to blur or yellow; LiveAvatar face stays sharp.
6.3 10 000-second stress test
We looped the 7-min audio to 166 min and kept generating.
ASE/IQA numbers within ±0.02 across 0 s, 100 s, 1 000 s, 10 000 s segments—no measurable drift.
7. Ablation: keep part, drop part
| Variant | FPS | TTFF (s) | ASE (7 min) |
|---|---|---|---|
| Remove TPP | 4.3 | 3.88 | 3.13 |
| Remove DMD (80 steps) | 0.29 | 45.5 | 3.40 |
| Remove VAE parallel | 10.2 | 4.73 | 3.44 |
| Full system | 20.88 | 2.89 | 3.38 |
TTFF = time-to-first-frame, includes VAE decode and random frame alignment.
8. User study (20 volunteers, double blind)
| Model | Naturalness | Sync (holistic) | Consistency |
|---|---|---|---|
| OmniAvatar | 71.1 | 78.5 | 90.8 |
| Wan-S2V | 84.3 | 85.2 | 92.0 |
| LiveAvatar | 86.3 | 80.6 | 91.1 |
Humans preferred LiveAvatar’s overall natural feel even when lip-score metrics were slightly lower—proof that chasing only Sync-C can overcook mouth motion.
9. Install and run (copy-paste ready)
Hardware: 5×NVIDIA H800 (80 GB) or wait for the upcoming 4090 branch.
Software: Ubuntu 22.04, CUDA 12.4, Python 3.10.
# 1. Create env
conda create -n liveavatar python=3.10 -y
conda activate liveavatar
# 2. CUDA toolkit
conda install nvidia/label/cuda-12.4.1::cuda -y
conda install -c nvidia/label/cuda-12.4.1 cudatoolkit -y
# 3. PyTorch + FlashAttention
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.8.3 --no-build-isolation
# 4. Other deps
pip install -r requirements.txt
# 5. Download weights (~30 GB)
git lfs install
git clone https://huggingface.co/Quark-Vision/Live-Avatar checkpoints/
# 6. Run infinite stream
bash scripts/infinite_inference.sh \
--audio my_voice.wav \
--image my_face.jpg \
--text "Welcome to the live stream." \
--outdir ./out \
--fps 20
-
First 20 frames fill the pipeline; then player receives live 720×400 mp4 chunks. -
Press Ctrl-C to stop; output already playable.
10. FAQs from early testers
Q1: Can I run this on two RTX 4090 cards today?
A: Not yet. The repo’s 4-step model is optimised for H800. A 3-step + INT8 build is scheduled for late December; it targets 15 FPS on 2×4090—watch the GitHub release page.
Q2: Why only 720×400?
A: VAE decode time grows linearly with pixel count. The team is testing SVD-quantised VAE and tiled decoding to hit 1080p without losing 20 FPS.
Q3: Does it work for singing or cartoon characters?
A: Yes. GenBench includes both. Cartoon faces actually score higher on ASE because colour blocks are easier to keep consistent.
Q4: Is the code trainable or inference-only?
A: Inference is open now. Training scripts (FSDP + DMD) will be released next month; the paper gives exact hyper-parameters if you cannot wait.
Q5: What about misuse?
A: The licence bans impersonation without consent; each frame carries an invisible watermark; hash checksums are logged. See Ethics section in the paper.
11. Roadmap (copied from repo)
Early December (now)
-
✅ Paper & project page -
✅ Demo site -
⬜ Gradio demo -
⬜ 4090 / A100 3-step build -
⬜ ComfyUI node
Q1 2026
-
1.3B student model (mobile target) -
TTS integration (CosyVoice 2) -
1080p tiled VAE
12. Key terms glossary (plain English)
-
Diffusion model: a noise-to-image network that removes grain step by step. -
DMD (Distribution-Matching Distillation): teacher-student trick to shrink 80 steps into 4. -
KV-cache: past frame features the model re-uses so it doesn’t recompute everything. -
RoPE: position code that tells the transformer where a token sits in time; Rolling-RoPE keeps the anchor always “next door”. -
Sink frame: the single reference photo (or latent) the model keeps looking back at to remember the face. -
TTFF: time from pressing “start” to the first complete picture; critical for live apps.
13. Take-away in one breath
LiveAvatar shows that if you:
-
Chop the diffusion chain into fixed GPU chores, -
Keep the face reference inside the model’s own distribution, and -
Add a dash of noise to the memory,
then a 14-billion-parameter model can stream real-time, lip-synced, infinitely long avatars on five GPUs without breaking a sweat. The code is already on GitHub; the hard numbers above help you judge whether it fits your product or research next step.
