Turn Chat into a Real Face: Inside RealVideo, the WebSocket Video-Calling Engine That Speaks Back
A plain-language walkthrough for college-level readers: how to install, tune, and deploy a live text → speech → lip-sync pipeline on two 80 GB GPUs, without writing a single line of extra code.
1. What Exactly Does RealVideo Do?
RealVideo is an open-source stack that lets you:
-
Type a sentence in a browser. -
Hear an AI voice answer instantly. -
Watch a real photograph speak the answer with perfectly synced lip motion.
All three events happen in <500 ms inside one browser tab—no plug-ins, no After Effects, no video editor.
2. Quick Glance: How Data Flows
Browser ⇄ WebSocket (JSON)
Server side
① GLM-4.5-AirX → semantic tokens
② GLM-TTS → audio waveform
③ DiT diffusion → face motion latent
④ VAE → pixel video
⑤ Back to browser → <video> tag plays
The round-trip budget is kept under half a second by splitting one GPU for VAE and the rest for DiT in parallel.
3. Can My Machine Handle It? Checklist Before You Start
| Item | Minimum | Sweet Spot | Notes |
|---|---|---|---|
| GPUs | 2 × 40 GB | 2 × 80 GB (H100 / H200 / A100) | One card is monopolised by VAE |
| CUDA | 11.8 | 12.1 | Matches PyTorch 2.1+ |
| Python | 3.10 | 3.10–3.12 | 3.9 fails on dependency resolver |
| Browser | Chrome 108 | Latest stable | Needs Web Audio API |
| Bandwidth | 20 Mbps up | 50 Mbps up | For 720 p30 stream inside LAN |
4. Installation: Copy-and-Paste Level
Tested on Ubuntu 22.04; identical steps work in WSL2.
4.1 Clone and build environment
git clone https://huggingface.co/zai-org/RealVideo
cd RealVideo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
4.2 Fetch the big diffusion weights (≈28 GB)
huggingface-cli download Wan-AI/Wan2.2-S2V-14B \
--local-dir-use-symlinks False \
--local-dir wan_models/Wan2.2-S2V-14B
4.3 API key (free tier works)
export ZAI_API_KEY="paste_your_key_here"
Get the key from z.ai console in under a minute.
4.4 Tell the code where the model lives
Open config/config.py and change only this line:
PATH_TO_YOUR_MODEL = "wan_models/Wan2.2-S2V-14B/model.pt"
4.5 Fire it up
CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/run_app.sh
You will see
WebSocket server listening on 0.0.0.0:8003
Open http://localhost:8003 in your browser.
5. First Run in 60 Seconds
| Step | What to Click | What You Should See |
|---|---|---|
| 1. Set avatar | “Upload Image” | A crop box around your photo |
| 2. Clone voice (optional) | “Upload Audio” ≥3 s | Waveform + “Voice Registered” |
| 3. Connect | Blue “Connect” button | Button turns green |
| 4. Chat | Type “Hello” → Enter | Left pane plays live video, lip-sync correct |
6. Speed Benchmarks: Numbers You Can Quote
Official DiT timing per 16-frame chunk (ms):
| DiT并行数 / 去噪步 | 2 步 | 4 步 |
|---|---|---|
| 1 GPU | 563 (442 compiled) | 943 |
| 2 GPU | 384 | 655 |
| 4 GPU | 306 | 513 (480 compiled) |
Rule of thumb: stay below 500 ms → choose 2 GPUs × 2 denoise steps or 4 GPUs × 2 steps.
Torch-compile gives a free 20 % boost and is already enabled in run_app.sh.
7. Folder Map: Where to Hack Safely
RealVideo
├── scripts
│ ├── run_app.sh # entry point
│ └── dit_service.py # DiT workers (parallel)
├── config
│ └── config.py # ONLY file you edit by hand
├── vae_server.py # decodes latents → pixels
├── websocket_server.py # signalling + orchestration
└── static
├── index.html # browser UI
└── js/webrtc.js # WebSocket + Web Audio glue
Typical tweaks
-
Swap voice: replace GLM-TTS call in websocket_server.pywith your checkpoint. -
Higher resolution: change out_sizeinvae_server.pyto(768,768)and raise steps to 4. -
Add auth: validate JWT inside handler()before upgrading to WebSocket.
8. Troubleshooting FAQ
Q1 No audio in browser?
→ Autoplay blocked. Click the loud-speaker icon in the address bar and allow.
Q2 Out-of-memory at 81 GB?
→ Lower max_batch from 4 → 1 in config.py, or
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Q3 Garbled or ghosting video?
→ Latent shape mismatch. Make sure commit date of code matches weights date.
Q4 Corporate proxy kills HF download?
export HF_ENDPOINT=https://hf-mirror.com then retry.
9. Pushing Faster: Three Production Tricks
-
Place VAE on a separate A10 and DiT on H100; saves ~40 ms PCIe shuffle. -
8-bit quantise GLM-TTS → half the VRAM, no audible loss. -
Feed YUV directly through WebCodecs API, skip JPEG encode → another 30 ms off.
10. How RealVideo Compares to Other Routes
| Metric | RealVideo | Pre-record WebRTC | 3-D Avatar (audio only) |
|---|---|---|---|
| Realism | Photo + diffusion lip-sync | High (but fixed) | Cartoony |
| Latency | <500 ms end-to-end | 0 ms playback, not interactive | 200 ms audio only |
| Custom face | Swap any JPG | Re-shoot whole video | Re-rig bones |
| Hardware | 2 × 80 GB GPUs | 0 | 1 × 24 GB |
11. Licence & Credits
-
Model weights: z.ai academic + commercial licence (attribution required). -
Code: MIT. -
Includes Self-Forcing library (MIT) by guandeh17—thanks!
12. What Will You Build Next?
RealVideo turns the old idea of a “digital human” into a shell one-liner: two GPUs, one export, one bash script.
From here you can bolt on long-term memory, multi-camera streaming, or drop the WebSocket URL inside a Shopify chat widget—your photo, your voice, answering buyers 24/7.
Clone the repo, run the script, and you will see yourself say something you never recorded—a moment that makes most first-time users laugh out loud.
After that, the only limit is what you decide to type.
