Turn Chat into a Real Face: Inside RealVideo, the WebSocket Video-Calling Engine That Speaks Back

A plain-language walkthrough for college-level readers: how to install, tune, and deploy a live text → speech → lip-sync pipeline on two 80 GB GPUs, without writing a single line of extra code.

1. What Exactly Does RealVideo Do?

RealVideo is an open-source stack that lets you:

Type a sentence in a browser.
Hear an AI voice answer instantly.
Watch a real photograph speak the answer with perfectly synced lip motion.

All three events happen in <500 ms inside one browser tab—no plug-ins, no After Effects, no video editor.

2. Quick Glance: How Data Flows

Browser ⇄ WebSocket (JSON)  
Server side  
① GLM-4.5-AirX → semantic tokens  
② GLM-TTS → audio waveform  
③ DiT diffusion → face motion latent  
④ VAE → pixel video  
⑤ Back to browser → <video> tag plays

The round-trip budget is kept under half a second by splitting one GPU for VAE and the rest for DiT in parallel.

3. Can My Machine Handle It? Checklist Before You Start

Item	Minimum	Sweet Spot	Notes
GPUs	2 × 40 GB	2 × 80 GB (H100 / H200 / A100)	One card is monopolised by VAE
CUDA	11.8	12.1	Matches PyTorch 2.1+
Python	3.10	3.10–3.12	3.9 fails on dependency resolver
Browser	Chrome 108	Latest stable	Needs Web Audio API
Bandwidth	20 Mbps up	50 Mbps up	For 720 p30 stream inside LAN

4. Installation: Copy-and-Paste Level

Tested on Ubuntu 22.04; identical steps work in WSL2.

4.1 Clone and build environment

git clone https://huggingface.co/zai-org/RealVideo
cd RealVideo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

4.2 Fetch the big diffusion weights (≈28 GB)

huggingface-cli download Wan-AI/Wan2.2-S2V-14B \
            --local-dir-use-symlinks False \
            --local-dir wan_models/Wan2.2-S2V-14B

4.3 API key (free tier works)

export ZAI_API_KEY="paste_your_key_here"

Get the key from z.ai console in under a minute.

4.4 Tell the code where the model lives

Open config/config.py and change only this line:

PATH_TO_YOUR_MODEL = "wan_models/Wan2.2-S2V-14B/model.pt"

4.5 Fire it up

CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/run_app.sh

You will see
WebSocket server listening on 0.0.0.0:8003
Open http://localhost:8003 in your browser.

5. First Run in 60 Seconds

Step	What to Click	What You Should See
1. Set avatar	“Upload Image”	A crop box around your photo
2. Clone voice (optional)	“Upload Audio” ≥3 s	Waveform + “Voice Registered”
3. Connect	Blue “Connect” button	Button turns green
4. Chat	Type “Hello” → Enter	Left pane plays live video, lip-sync correct

6. Speed Benchmarks: Numbers You Can Quote

Official DiT timing per 16-frame chunk (ms):

DiT并行数 / 去噪步	2 步	4 步
1 GPU	563 (442 compiled)	943
2 GPU	384	655
4 GPU	306	513 (480 compiled)

Rule of thumb: stay below 500 ms → choose 2 GPUs × 2 denoise steps or 4 GPUs × 2 steps.
Torch-compile gives a free 20 % boost and is already enabled in run_app.sh.

7. Folder Map: Where to Hack Safely

RealVideo
├── scripts
│   ├── run_app.sh          # entry point
│   └── dit_service.py      # DiT workers (parallel)
├── config
│   └── config.py           # ONLY file you edit by hand
├── vae_server.py           # decodes latents → pixels
├── websocket_server.py     # signalling + orchestration
└── static
    ├── index.html          # browser UI
    └── js/webrtc.js        # WebSocket + Web Audio glue

Typical tweaks

Swap voice: replace GLM-TTS call in websocket_server.py with your checkpoint.
Higher resolution: change out_size in vae_server.py to (768,768) and raise steps to 4.
Add auth: validate JWT inside handler() before upgrading to WebSocket.

8. Troubleshooting FAQ

Q1 No audio in browser?
→ Autoplay blocked. Click the loud-speaker icon in the address bar and allow.

Q2 Out-of-memory at 81 GB?
→ Lower max_batch from 4 → 1 in config.py, or
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Q3 Garbled or ghosting video?
→ Latent shape mismatch. Make sure commit date of code matches weights date.

Q4 Corporate proxy kills HF download?
export HF_ENDPOINT=https://hf-mirror.com then retry.

9. Pushing Faster: Three Production Tricks

Place VAE on a separate A10 and DiT on H100; saves ~40 ms PCIe shuffle.
8-bit quantise GLM-TTS → half the VRAM, no audible loss.
Feed YUV directly through WebCodecs API, skip JPEG encode → another 30 ms off.

10. How RealVideo Compares to Other Routes

Metric	RealVideo	Pre-record WebRTC	3-D Avatar (audio only)
Realism	Photo + diffusion lip-sync	High (but fixed)	Cartoony
Latency	<500 ms end-to-end	0 ms playback, not interactive	200 ms audio only
Custom face	Swap any JPG	Re-shoot whole video	Re-rig bones
Hardware	2 × 80 GB GPUs	0	1 × 24 GB

11. Licence & Credits

Model weights: z.ai academic + commercial licence (attribution required).
Code: MIT.
Includes Self-Forcing library (MIT) by guandeh17—thanks!

12. What Will You Build Next?

RealVideo turns the old idea of a “digital human” into a shell one-liner: two GPUs, one export, one bash script.
From here you can bolt on long-term memory, multi-camera streaming, or drop the WebSocket URL inside a Shopify chat widget—your photo, your voice, answering buyers 24/7.

Clone the repo, run the script, and you will see yourself say something you never recorded—a moment that makes most first-time users laugh out loud.
After that, the only limit is what you decide to type.

How RealVideo’s WebSocket Engine Creates Real-Time AI Avatars on 80GB GPUs