Turn Chat into a Real Face: Inside RealVideo, the WebSocket Video-Calling Engine That Speaks Back

A plain-language walkthrough for college-level readers: how to install, tune, and deploy a live text → speech → lip-sync pipeline on two 80 GB GPUs, without writing a single line of extra code.


1. What Exactly Does RealVideo Do?

RealVideo is an open-source stack that lets you:

  1. Type a sentence in a browser.
  2. Hear an AI voice answer instantly.
  3. Watch a real photograph speak the answer with perfectly synced lip motion.

All three events happen in <500 ms inside one browser tab—no plug-ins, no After Effects, no video editor.


2. Quick Glance: How Data Flows

Browser ⇄ WebSocket (JSON)  
Server side  
① GLM-4.5-AirX → semantic tokens  
② GLM-TTS → audio waveform  
③ DiT diffusion → face motion latent  
④ VAE → pixel video  
⑤ Back to browser → <video> tag plays

The round-trip budget is kept under half a second by splitting one GPU for VAE and the rest for DiT in parallel.


3. Can My Machine Handle It? Checklist Before You Start

Item Minimum Sweet Spot Notes
GPUs 2 × 40 GB 2 × 80 GB (H100 / H200 / A100) One card is monopolised by VAE
CUDA 11.8 12.1 Matches PyTorch 2.1+
Python 3.10 3.10–3.12 3.9 fails on dependency resolver
Browser Chrome 108 Latest stable Needs Web Audio API
Bandwidth 20 Mbps up 50 Mbps up For 720 p30 stream inside LAN

4. Installation: Copy-and-Paste Level

Tested on Ubuntu 22.04; identical steps work in WSL2.

4.1 Clone and build environment

git clone https://huggingface.co/zai-org/RealVideo
cd RealVideo
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

4.2 Fetch the big diffusion weights (≈28 GB)

huggingface-cli download Wan-AI/Wan2.2-S2V-14B \
            --local-dir-use-symlinks False \
            --local-dir wan_models/Wan2.2-S2V-14B

4.3 API key (free tier works)

export ZAI_API_KEY="paste_your_key_here"

Get the key from z.ai console in under a minute.

4.4 Tell the code where the model lives

Open config/config.py and change only this line:

PATH_TO_YOUR_MODEL = "wan_models/Wan2.2-S2V-14B/model.pt"

4.5 Fire it up

CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/run_app.sh

You will see
WebSocket server listening on 0.0.0.0:8003
Open http://localhost:8003 in your browser.


5. First Run in 60 Seconds

Step What to Click What You Should See
1. Set avatar “Upload Image” A crop box around your photo
2. Clone voice (optional) “Upload Audio” ≥3 s Waveform + “Voice Registered”
3. Connect Blue “Connect” button Button turns green
4. Chat Type “Hello” → Enter Left pane plays live video, lip-sync correct

6. Speed Benchmarks: Numbers You Can Quote

Official DiT timing per 16-frame chunk (ms):

DiT并行数 / 去噪步 2 步 4 步
1 GPU 563 (442 compiled) 943
2 GPU 384 655
4 GPU 306 513 (480 compiled)

Rule of thumb: stay below 500 ms → choose 2 GPUs × 2 denoise steps or 4 GPUs × 2 steps.
Torch-compile gives a free 20 % boost and is already enabled in run_app.sh.


7. Folder Map: Where to Hack Safely

RealVideo
├── scripts
│   ├── run_app.sh          # entry point
│   └── dit_service.py      # DiT workers (parallel)
├── config
│   └── config.py           # ONLY file you edit by hand
├── vae_server.py           # decodes latents → pixels
├── websocket_server.py     # signalling + orchestration
└── static
    ├── index.html          # browser UI
    └── js/webrtc.js        # WebSocket + Web Audio glue

Typical tweaks

  • Swap voice: replace GLM-TTS call in websocket_server.py with your checkpoint.
  • Higher resolution: change out_size in vae_server.py to (768,768) and raise steps to 4.
  • Add auth: validate JWT inside handler() before upgrading to WebSocket.

8. Troubleshooting FAQ

Q1 No audio in browser?
→ Autoplay blocked. Click the loud-speaker icon in the address bar and allow.

Q2 Out-of-memory at 81 GB?
→ Lower max_batch from 4 → 1 in config.py, or
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Q3 Garbled or ghosting video?
→ Latent shape mismatch. Make sure commit date of code matches weights date.

Q4 Corporate proxy kills HF download?
export HF_ENDPOINT=https://hf-mirror.com then retry.


9. Pushing Faster: Three Production Tricks

  1. Place VAE on a separate A10 and DiT on H100; saves ~40 ms PCIe shuffle.
  2. 8-bit quantise GLM-TTS → half the VRAM, no audible loss.
  3. Feed YUV directly through WebCodecs API, skip JPEG encode → another 30 ms off.

10. How RealVideo Compares to Other Routes

Metric RealVideo Pre-record WebRTC 3-D Avatar (audio only)
Realism Photo + diffusion lip-sync High (but fixed) Cartoony
Latency <500 ms end-to-end 0 ms playback, not interactive 200 ms audio only
Custom face Swap any JPG Re-shoot whole video Re-rig bones
Hardware 2 × 80 GB GPUs 0 1 × 24 GB

11. Licence & Credits

  • Model weights: z.ai academic + commercial licence (attribution required).
  • Code: MIT.
  • Includes Self-Forcing library (MIT) by guandeh17—thanks!

12. What Will You Build Next?

RealVideo turns the old idea of a “digital human” into a shell one-liner: two GPUs, one export, one bash script.
From here you can bolt on long-term memory, multi-camera streaming, or drop the WebSocket URL inside a Shopify chat widget—your photo, your voice, answering buyers 24/7.

Clone the repo, run the script, and you will see yourself say something you never recorded—a moment that makes most first-time users laugh out loud.
After that, the only limit is what you decide to type.