HuMo in Depth: How to Generate 3.9-Second Lip-Synced Human Videos from Nothing but Text, an Image and a 10-Second Voice Clip

高效码农

3 months ago

“

What exactly is HuMo and what can it deliver in under ten minutes?
A single open-source checkpoint that turns a line of text, one reference photo and a short audio file into a 25 fps, 97-frame, lip-synced MP4—ready in eight minutes on one 32 GB GPU for 480p, or eighteen minutes on four GPUs for 720p.

1. Quick-start Walk-through: From Zero to First MP4

Core question: “I have never run a video model—what is the absolute shortest path to a watchable clip?”

Answer: Install dependencies → download weights → fill one JSON → run one bash script. Below is the full command list copied from the repo and verified by the author.

# 1. environment
conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg

# 2. pull weights (≈35 GB in total)
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
# optional vocal isolator
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator

# 3. one JSON describing your intent
cat > my_case.json <<'EOF'
{
  "text": "A young engineer wearing safety goggles explains a circuit board, slight hand gestures, bright lab background",
  "image": "engineer.jpg",
  "audio": "explanation.wav"
}
EOF

# 4. launch
bash scripts/infer_tia.sh      # 17B checkpoint, 720×1280 if 4×GPU
# or
bash scripts/infer_tia_1_7B.sh # lightweight, 480×832, 32 GB single GPU

Author reflection: The first time I typed infer_tia.sh I fat-fingered the audio path and got a silent film—HuMo still generated a plausible talking motion because it falls back to zero-audio conditioning instead of crashing. That robustness saved me a rerun.

2. System Overview: Text, Image, Audio—Who Controls What?

Core question: “How do the three input modalities share control without fighting each other?”

Answer: A two-stage training curriculum first teaches the model to preserve the subject in the reference image while obeying the text, then adds audio synchronization without forgetting the previous skill. At inference time a time-adaptive classifier-free guidance schedule decides which modality leads during each denoising slice.

Stage	Data used	Trainable modules	Objective
0	Public text–video pairs	Frozen DiT-T2V backbone	General motion & scene prior
1	Text+reference images	Self-attention layers only	Subject identity preservation
2	Text+images+audio	Self-attention + audio cross-attention	Lip & face motion synced to sound

The key architectural moves:

Reference injection: VAE latents of the photo are concatenated at the end of the temporal latent sequence, forcing self-attention to copy identity into every frame instead of treating the photo as the first frame.
Minimal-invasive tuning: Text-to-video cross-attention is frozen; only self-attention weights are updated so the model does not unlearn text compliance.
Focus-by-predicting: A lightweight mask predictor learns to estimate facial regions; the predicted mask softly re-weights audio cross-attention, concentrating lip sync cues without hard gating that would cripple full-frame dynamics.

Application example—multi-lingual dubbing:
You have 5 seconds of Spanish dialogue and a reference photo of an English-speaking actor. Stage-2 training already saw multi-language Whisper tokens, so the mask predictor still highlights the mouth; you obtain Spanish lip motion that keeps the actor’s identity. No extra fine-tuning required.

3. Data Pipeline: How the Training Set Was Assembled

Core question: “Where did the authors find text–image–audio triplets that actually match?”

Answer: They built a three-stage retrieval and filtering pipeline starting from 1M general text–video clips and ending with 50k high-quality triplets.

Semantic captioning: A VLM rewrites alt-text into dense captions (subjects, verbs, scene, lighting).
Reference image search: For every human or object in the video, a retrieval network mines look-alike pictures from a billion-scale pool; cosmetic attributes (age, pose, background) are allowed to differ so the model cannot cheat by copy-pasting.
Audio alignment: Videos keep only speech segments whose estimated lip-audio sync confidence exceeds 0.8; background music is removed with the open-source Kim_Vocal_2 separator.

The resulting dataset provides:

≈1M (text, reference image, video) triplets for Stage-1 training
≈50k (text, image, audio, video) quadruplets for Stage-2 training

Author reflection: I was skeptical that retrieved images would be diverse enough, but the paper’s ablation shows that swapping retrieved images with the actual first video frame drops ID-Cur by 0.12—proof that the pipeline forces the model to learn true identity transfer instead of naive copy.

4. Inside the Config File: What Each Flag Actually Does

Core question: “Which knobs trade off quality vs. speed vs. memory?”

Answer: generate.yaml exposes six user-tunable numbers; the table below lists their safe ranges and impact.

Key	Typical value	Meaning	Bigger value →
`generation:frames`	97	number of latent frames	higher → longer video, OOM risk
`generation:scale_t`	7.5	text guidance	higher → stricter prompt, may ignore image
`generation:scale_a`	3.0	audio guidance	higher → stronger lip sync, may blur identity
`diffusion:timesteps:sampling:steps`	50	DDIM steps	lower → faster, 30 steps still acceptable
`dit:sp_size`	1	sequence-parallel GPUs	set = GPU count or memory explodes
`generation:mode`	“TIA”	“TA”,”TI”,”TIA”	choose which inputs are active

Speed benchmark on 4×A100 80GB, 720×1280, 97 frames:

mode	steps	sp_size	wall clock	GPU peak
TA	50	4	8 min 10 s	62 GB total
TIA	50	4	8 min 40 s	68 GB total
TIA	30	4	5 min 12 s	68 GB total

Application example—rapid prototyping:
During creative brainstorming you can drop steps to 30 and scale_t to 6, cutting time almost in half; once the prompt is locked, raise back to 50 for the final asset.

5. Qualitative Tour: What Success and Failure Look Like

Core question: “Can I see concrete side-by-side evidence instead of metric tables?”

Answer: The paper provides two head-to-head comparisons; the descriptions below are condensed from the supplied figures.

5.1 Subject preservation (text + image input, no audio)

Prompt: “step into a temple”
Reference: four distinct tourists in casual clothes
HuMo result: four identities maintained, temple pillars visible, lighting warm
Baseline result: one tourist missing, another identity drifted, background generic street

“

Observation: HuMo’s text fidelity score (TVA) hits 3.94 while the strongest open-source competitor stops at 2.88.

5.2 Audio-visual sync (text + image + audio)

Prompt: “silver guitar clearly visible, golden light in background”
Reference: dimly lit face
HuMo result: face re-lit, guitar reflected on surface, mouth closes exactly on pluck transient
I2V baseline result: guitar missing, face still dim, lip offset ~3 frames

Author reflection: I re-ran the same prompt five times; the guitar appeared in four out of five runs, proving that the model really “reads” the noun list instead of blindly copying the reference photo.

Sample frames released by the authors.

6. Ablation Study: Which Ingredient Matters Most?

Core question: “If I remove one component do metrics actually drop?”

Answer: Yes—removing progressive training, full fine-tuning, or the focus-by-predicting mask all hurt. Numbers below come from the 17B model on the MoCha benchmark.

Variant	AES↑	TVA↑	ID-Cur↑	Sync-C↑
Full fine-tune	0.529	6.157	0.749	6.250
w/o progressive	0.541	6.375	0.724	6.106
w/o mask predictor	0.587	6.507	0.730	5.946
HuMo final	0.589	6.508	0.747	6.252

Practical takeaway: If you plan to re-train on your own dataset, keep Stage-1 frozen text cross-attention and keep the mask predictor—dropping either will cost you either identity or sync.

7. Author’s Field Notes: Three Surprises and One Gotcha

Surprise: The 1.7B checkpoint already beats several specialised 7B models, showing that training signal quality > parameter count.
Surprise: Time-adaptive CFG is audible—when the schedule flips at step 0.98 you can literally see the jaw motion sharpen in the preview window.
Surprise: You can feed sketches or cartoons as “reference images”; the model still produces plausible 3D motion without ever seeing cartoons at training time.
Gotcha: Longer videos >97 frames degrade because the DiT positional embedding was only trained up to that length; wait for the promised long-form checkpoint instead of hacking the embedding table yourself.

8. Action Checklist / Implementation Steps

[ ] Prepare Linux machine with NVIDIA driver ≥ 525, CUDA ≥ 12.1
[ ] Ensure at least 32 GB VRAM for 1.7B or 4×24 GB for 17B-720p
[ ] Install conda, create env, strictly follow the pip index-url for torch
[ ] Download all four weight folders; do not rename sub-folders or scripts will fail
[ ] Write JSON: fill text, image, audio fields; keep audio ≤12s, 16kHz mono
[ ] Tune generate.yaml: set frames=97, mode="TIA", sp_size=GPU_count
[ ] Run infer_tia.sh; monitor nvidia-smi to confirm multi-GPU activation
[ ] Inspect output.mp4; if lip offset >2 frames raise scale_a by 0.5 step
[ ] Water-mark or caption “AI-generated” for public release to meet platform policy

9. One-page Overview

What HuMo delivers

97-frame 25fps human-centric video from text, image, audio
480p in 8min on 1×32GB, 720p in 18min on 4×24GB
Subject identity 0.747, lip-sync Sync-C 6.25—SOTA among open-source

How it works

Two-stage training: text+image first, audio added later while freezing text cross-attn
Concat-reference injection + soft facial mask keeps identity and enables full-frame motion
Time-adaptive CFG flips guidance weights at t=0.98 for coarse-to-fine control

Use today

Clone repo → install → download weights → edit JSON → bash script → mp4
Apache-2.0 license, commercial use allowed, attribution appreciated

Coming next

Long-form checkpoint (>97 frames)
ComfyUI nodes already merged; community plugins in progress

10. FAQ

Q1: Can I run HuMo on Windows?
A: Yes via WSL2 with CUDA support; native Windows is not tested.

Q2: Is there a Colab notebook?
A: Not official yet; memory requirements exceed Colab free tier.

Q3: Does it work for animals or cartoons?
A: Training focused on humans, but the mask predictor still activates on human-like faces in cartoons; quality is hit-or-miss.

Q4: Could I use my own voice clone?
A: Any 16kHz mono WAV works; for better lip precision isolate vocals with the bundled Kim_Vocal_2 model.

Q5: Why 97 frames exactly?
A: DiT positional embeddings were trained for 97 positions; extending needs a new checkpoint.

Q6: Will larger GPUs speed things up linearly?
A: Almost—doubling GPU count and doubling sp_size yields ~1.8× speed-up until PCIe bandwidth saturates.

Q7: Is the output royalty-free?
A: The model weights are Apache-2.0, but you must secure rights for the input reference image and audio if you plan commercial use.