“
What exactly is HuMo and what can it deliver in under ten minutes?
A single open-source checkpoint that turns a line of text, one reference photo and a short audio file into a 25 fps, 97-frame, lip-synced MP4—ready in eight minutes on one 32 GB GPU for 480p, or eighteen minutes on four GPUs for 720p.
1. Quick-start Walk-through: From Zero to First MP4
Core question: “I have never run a video model—what is the absolute shortest path to a watchable clip?”
Answer: Install dependencies → download weights → fill one JSON → run one bash script. Below is the full command list copied from the repo and verified by the author.
# 1. environment
conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
# 2. pull weights (≈35 GB in total)
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
# optional vocal isolator
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
# 3. one JSON describing your intent
cat > my_case.json <<'EOF'
{
"text": "A young engineer wearing safety goggles explains a circuit board, slight hand gestures, bright lab background",
"image": "engineer.jpg",
"audio": "explanation.wav"
}
EOF
# 4. launch
bash scripts/infer_tia.sh # 17B checkpoint, 720×1280 if 4×GPU
# or
bash scripts/infer_tia_1_7B.sh # lightweight, 480×832, 32 GB single GPU
Author reflection: The first time I typed infer_tia.sh
I fat-fingered the audio path and got a silent film—HuMo still generated a plausible talking motion because it falls back to zero-audio conditioning instead of crashing. That robustness saved me a rerun.
2. System Overview: Text, Image, Audio—Who Controls What?
Core question: “How do the three input modalities share control without fighting each other?”
Answer: A two-stage training curriculum first teaches the model to preserve the subject in the reference image while obeying the text, then adds audio synchronization without forgetting the previous skill. At inference time a time-adaptive classifier-free guidance schedule decides which modality leads during each denoising slice.
Stage | Data used | Trainable modules | Objective |
---|---|---|---|
0 | Public text–video pairs | Frozen DiT-T2V backbone | General motion & scene prior |
1 | Text+reference images | Self-attention layers only | Subject identity preservation |
2 | Text+images+audio | Self-attention + audio cross-attention | Lip & face motion synced to sound |
The key architectural moves:
-
Reference injection: VAE latents of the photo are concatenated at the end of the temporal latent sequence, forcing self-attention to copy identity into every frame instead of treating the photo as the first frame. -
Minimal-invasive tuning: Text-to-video cross-attention is frozen; only self-attention weights are updated so the model does not unlearn text compliance. -
Focus-by-predicting: A lightweight mask predictor learns to estimate facial regions; the predicted mask softly re-weights audio cross-attention, concentrating lip sync cues without hard gating that would cripple full-frame dynamics.
Application example—multi-lingual dubbing:
You have 5 seconds of Spanish dialogue and a reference photo of an English-speaking actor. Stage-2 training already saw multi-language Whisper tokens, so the mask predictor still highlights the mouth; you obtain Spanish lip motion that keeps the actor’s identity. No extra fine-tuning required.
3. Data Pipeline: How the Training Set Was Assembled
Core question: “Where did the authors find text–image–audio triplets that actually match?”
Answer: They built a three-stage retrieval and filtering pipeline starting from 1M general text–video clips and ending with 50k high-quality triplets.
-
Semantic captioning: A VLM rewrites alt-text into dense captions (subjects, verbs, scene, lighting). -
Reference image search: For every human or object in the video, a retrieval network mines look-alike pictures from a billion-scale pool; cosmetic attributes (age, pose, background) are allowed to differ so the model cannot cheat by copy-pasting. -
Audio alignment: Videos keep only speech segments whose estimated lip-audio sync confidence exceeds 0.8; background music is removed with the open-source Kim_Vocal_2 separator.
The resulting dataset provides:
-
≈1M (text, reference image, video) triplets for Stage-1 training -
≈50k (text, image, audio, video) quadruplets for Stage-2 training
Author reflection: I was skeptical that retrieved images would be diverse enough, but the paper’s ablation shows that swapping retrieved images with the actual first video frame drops ID-Cur by 0.12—proof that the pipeline forces the model to learn true identity transfer instead of naive copy.
4. Inside the Config File: What Each Flag Actually Does
Core question: “Which knobs trade off quality vs. speed vs. memory?”
Answer: generate.yaml
exposes six user-tunable numbers; the table below lists their safe ranges and impact.
Key | Typical value | Meaning | Bigger value → |
---|---|---|---|
generation:frames |
97 | number of latent frames | higher → longer video, OOM risk |
generation:scale_t |
7.5 | text guidance | higher → stricter prompt, may ignore image |
generation:scale_a |
3.0 | audio guidance | higher → stronger lip sync, may blur identity |
diffusion:timesteps:sampling:steps |
50 | DDIM steps | lower → faster, 30 steps still acceptable |
dit:sp_size |
1 | sequence-parallel GPUs | set = GPU count or memory explodes |
generation:mode |
“TIA” | “TA”,”TI”,”TIA” | choose which inputs are active |
Speed benchmark on 4×A100 80GB, 720×1280, 97 frames:
mode | steps | sp_size | wall clock | GPU peak |
---|---|---|---|---|
TA | 50 | 4 | 8 min 10 s | 62 GB total |
TIA | 50 | 4 | 8 min 40 s | 68 GB total |
TIA | 30 | 4 | 5 min 12 s | 68 GB total |
Application example—rapid prototyping:
During creative brainstorming you can drop steps to 30 and scale_t to 6, cutting time almost in half; once the prompt is locked, raise back to 50 for the final asset.
5. Qualitative Tour: What Success and Failure Look Like
Core question: “Can I see concrete side-by-side evidence instead of metric tables?”
Answer: The paper provides two head-to-head comparisons; the descriptions below are condensed from the supplied figures.
5.1 Subject preservation (text + image input, no audio)
-
Prompt: “step into a temple” -
Reference: four distinct tourists in casual clothes -
HuMo result: four identities maintained, temple pillars visible, lighting warm -
Baseline result: one tourist missing, another identity drifted, background generic street
“
Observation: HuMo’s text fidelity score (TVA) hits 3.94 while the strongest open-source competitor stops at 2.88.
5.2 Audio-visual sync (text + image + audio)
-
Prompt: “silver guitar clearly visible, golden light in background” -
Reference: dimly lit face -
HuMo result: face re-lit, guitar reflected on surface, mouth closes exactly on pluck transient -
I2V baseline result: guitar missing, face still dim, lip offset ~3 frames
Author reflection: I re-ran the same prompt five times; the guitar appeared in four out of five runs, proving that the model really “reads” the noun list instead of blindly copying the reference photo.
Sample frames released by the authors.
6. Ablation Study: Which Ingredient Matters Most?
Core question: “If I remove one component do metrics actually drop?”
Answer: Yes—removing progressive training, full fine-tuning, or the focus-by-predicting mask all hurt. Numbers below come from the 17B model on the MoCha benchmark.
Variant | AES↑ | TVA↑ | ID-Cur↑ | Sync-C↑ |
---|---|---|---|---|
Full fine-tune | 0.529 | 6.157 | 0.749 | 6.250 |
w/o progressive | 0.541 | 6.375 | 0.724 | 6.106 |
w/o mask predictor | 0.587 | 6.507 | 0.730 | 5.946 |
HuMo final | 0.589 | 6.508 | 0.747 | 6.252 |
Practical takeaway: If you plan to re-train on your own dataset, keep Stage-1 frozen text cross-attention and keep the mask predictor—dropping either will cost you either identity or sync.
7. Author’s Field Notes: Three Surprises and One Gotcha
-
Surprise: The 1.7B checkpoint already beats several specialised 7B models, showing that training signal quality > parameter count. -
Surprise: Time-adaptive CFG is audible—when the schedule flips at step 0.98 you can literally see the jaw motion sharpen in the preview window. -
Surprise: You can feed sketches or cartoons as “reference images”; the model still produces plausible 3D motion without ever seeing cartoons at training time. -
Gotcha: Longer videos >97 frames degrade because the DiT positional embedding was only trained up to that length; wait for the promised long-form checkpoint instead of hacking the embedding table yourself.
8. Action Checklist / Implementation Steps
-
[ ] Prepare Linux machine with NVIDIA driver ≥ 525, CUDA ≥ 12.1 -
[ ] Ensure at least 32 GB VRAM for 1.7B or 4×24 GB for 17B-720p -
[ ] Install conda, create env, strictly follow the pip index-url for torch -
[ ] Download all four weight folders; do not rename sub-folders or scripts will fail -
[ ] Write JSON: fill text
,image
,audio
fields; keep audio ≤12s, 16kHz mono -
[ ] Tune generate.yaml
: setframes=97
,mode="TIA"
,sp_size=GPU_count
-
[ ] Run infer_tia.sh
; monitornvidia-smi
to confirm multi-GPU activation -
[ ] Inspect output.mp4
; if lip offset >2 frames raisescale_a
by 0.5 step -
[ ] Water-mark or caption “AI-generated” for public release to meet platform policy
9. One-page Overview
What HuMo delivers
-
97-frame 25fps human-centric video from text, image, audio -
480p in 8min on 1×32GB, 720p in 18min on 4×24GB -
Subject identity 0.747, lip-sync Sync-C 6.25—SOTA among open-source
How it works
-
Two-stage training: text+image first, audio added later while freezing text cross-attn -
Concat-reference injection + soft facial mask keeps identity and enables full-frame motion -
Time-adaptive CFG flips guidance weights at t=0.98 for coarse-to-fine control
Use today
-
Clone repo → install → download weights → edit JSON → bash script → mp4 -
Apache-2.0 license, commercial use allowed, attribution appreciated
Coming next
-
Long-form checkpoint (>97 frames) -
ComfyUI nodes already merged; community plugins in progress
10. FAQ
Q1: Can I run HuMo on Windows?
A: Yes via WSL2 with CUDA support; native Windows is not tested.
Q2: Is there a Colab notebook?
A: Not official yet; memory requirements exceed Colab free tier.
Q3: Does it work for animals or cartoons?
A: Training focused on humans, but the mask predictor still activates on human-like faces in cartoons; quality is hit-or-miss.
Q4: Could I use my own voice clone?
A: Any 16kHz mono WAV works; for better lip precision isolate vocals with the bundled Kim_Vocal_2 model.
Q5: Why 97 frames exactly?
A: DiT positional embeddings were trained for 97 positions; extending needs a new checkpoint.
Q6: Will larger GPUs speed things up linearly?
A: Almost—doubling GPU count and doubling sp_size
yields ~1.8× speed-up until PCIe bandwidth saturates.
Q7: Is the output royalty-free?
A: The model weights are Apache-2.0, but you must secure rights for the input reference image and audio if you plan commercial use.