From 500: How the New Pusa V1.0 Video Model Slashes Training Costs Without Cutting Corners
A plain-language guide for developers, artists, and small teams who want high-quality video generation on a tight budget.
TL;DR
-
Problem: Training a state-of-the-art image-to-video (I2V) model usually costs ≥ $100 k and needs ≥ 10 million clips. -
Solution: Pusa V1.0 uses vectorized timesteps—a tiny change in how noise is handled—so you can reach the same quality with $500 and 4 000 clips. -
Outcome: One checkpoint runs text-to-video, image-to-video, start-to-end frames, video extension, and transition tasks without extra training. -
Time to first clip: 30 minutes on an 8-GPU node (or 2 hours if you rent cloud GPUs).
1. What Makes Pusa Different?
Traditional video diffusion models treat every frame like one big block: the same noise level is added or removed at every step. That is simple, but it forces the model to relearn “keep the first frame unchanged” every time you ask it to animate a still picture. Pusa breaks this block into individual frame clocks—each frame gets its own noise level.
Traditional Scalar Timestep | Pusa Vectorized Timestep |
---|---|
One slider controls all 16 frames | 16 independent sliders |
Needs giant datasets to learn frame locking | Frame locking is handled by code, not data |
Fine-tuning can overwrite text-to-video skills | Original skills stay intact |
2. Quick Start: Run Your First Clip in 30 Minutes
2.1 One-Command Install
git clone https://github.com/Yaofang-Liu/Pusa-VidGen.git
cd Pusa-VidGen/PusaV1
pip install uv # fast Python package manager
uv venv .venv && source .venv/bin/activate
uv pip install -r requirements.txt
2.2 Pull the Weights
huggingface-cli download RaphaelLiu/PusaV1 --local-dir ./models
# or unzip the manual download into ./models
2.3 Text-to-Video (T2V) Example
python scripts/infer_t2v.py \
--model_dir ./models \
--prompt "A slow-motion close-up of a hummingbird hovering near red flowers" \
--num_steps 10 \
--output hummingbird.mp4
Result: 81-frame 24 fps clip, ready for social media.
2.4 Image-to-Video (I2V) Example
python scripts/infer_i2v.py \
--model_dir ./models \
--image_path ./assets/photo.jpg \
--prompt "The camera slowly pulls back revealing a beach sunset" \
--noise_multiplier 0.2 \
--num_steps 10 \
--output sunset_zoom.mp4
2.5 Start-and-End Frame Interpolation
python scripts/infer_start_end.py \
--model_dir ./models \
--start_frame ./assets/start.jpg \
--end_frame ./assets/end.jpg \
--prompt "Smooth cinematic dolly between two viewpoints" \
--num_steps 10 \
--output dolly.mp4
3. Under the Hood: Why Vector Timesteps Work
3.1 The Math in One Paragraph
Instead of a single scalar t ∈ [0,1]
that tells the whole video how noisy it is, Pusa gives every frame its own scalar τᵢ ∈ [0,1]
. Training becomes a frame-aware flow-matching problem: the network only needs to learn how each pixel should move given its personal clock. Keeping the first frame unchanged is now a sampling trick, not a new training objective.
3.2 Training Loop in 90 Seconds
-
Pick 3 860 clips from the open-source VBench 2.0 set (already captioned). -
Sample each frame’s τᵢ
uniformly at random (no sync steps needed). -
Fine-tune the Wan-T2V-14B backbone with LoRA rank-512 for 900 steps. -
Cost: 8×80 GB A100s × 2 h ≈ $500 on most cloud spot markets.
3.3 What Actually Changes in the Code?
-
Line 1: Time-embedding layer Linear(1, D)
→Linear(N, D)
-
Line 2: Each DiT block now receives per-frame modulation parameters
That is it—no new attention heads, no extra losses.
4. Step-by-Step Fine-Tuning Guide
4.1 Prepare Your Own Dataset
my_dataset/
├─ clip_0001.mp4
├─ clip_0001.txt
├─ clip_0002.mp4
└─ clip_0002.txt
Each .txt
file contains the caption for the matching video.
4.2 Launch Training (LoRA)
accelerate launch --config_file ds_config_zero2.yaml \
train_lora.py \
--base_model ./models/Wan-T2V-14B \
--data_root ./my_dataset \
--rank 512 \
--alpha 1.7 \
--lr 1e-4 \
--max_steps 900 \
--output_dir ./runs/my_pusa_lora
4.3 Merge LoRA for Faster Inference
python scripts/merge_lora.py \
--base_model ./models/Wan-T2V-14B \
--lora_path ./runs/my_pusa_lora/checkpoint-900 \
--output_path ./models/my_pusa_full
5. Benchmarks: How Much Quality Do You Lose?
Metric | Pusa LoRA | Wan-I2V (baseline) |
---|---|---|
VBench-I2V Total Score | 87.32 % | 86.86 % |
Subject Consistency | 97.64 % | 96.95 % |
Background Consistency | 99.24 % | 96.44 % |
Training Budget | $500 | ≥ $100 k |
Dataset Size | 4 k clips | ≥ 10 M clips |
Bonus: Pusa still scores 95 % on the original text-to-video task, proving no catastrophic forgetting.
6. Common Use-Case Recipes
Goal | Script | Pro Tip |
---|---|---|
Product demo from a photo | infer_i2v.py |
--noise_multiplier 0.2 adds gentle motion |
Storyboard preview | infer_start_end.py |
lock first and last storyboard frames |
Extend an existing ad | infer_extension.py |
use last 3 frames as context |
Multi-keyframe ad | infer_multi_frames.py |
supply 3-5 keyframes at 10-frame intervals |
Quick social post | infer_t2v.py |
--num_steps 4 for speed, 10 for quality |
7. Troubleshooting Quick Sheet
Issue | Quick Fix |
---|---|
Out-of-memory on 24 GB GPU | Add --offload or use gradient checkpointing |
First frame too static | Increase --noise_multiplier to 0.3-0.5 |
Colors look washed out | Encode source image in sRGB not Adobe RGB |
Long generation time | Reduce --num_steps to 4-6 for previews |
LoRA not loading | Ensure --alpha matches training value (1.7) |
8. Version History & Roadmap
Version | Base Model | Abilities | Release |
---|---|---|---|
V0.5 | Mochi-1-Preview | T2V, I2V, extension | Apr 2025 |
V1.0 | Wan-T2V-14B | + start-end, transition | Jul 2025 |
V1.1 | Wan-T2V-14B | 60-second clips (planned) | TBD |
V2.0 | Next open model | 4 K, longer context | Community driven |
9. Frequently Asked Questions
Q1: Do I need to retrain for each new task?
No. One LoRA checkpoint handles T2V, I2V, start-end, extension, and transitions out of the box.
Q2: How big is the download?
-
Model weights ~ 14 GB -
LoRA weights ~ 300 MB
Q3: Can I run this on a single RTX 4090?
Yes. Use --offload
and 25-step inference; 60-second clips finish in ~ 4 minutes.
Q4: Is commercial use allowed?
Check the Wan-T2V-14B license and LoRA weights license; both currently permit commercial use with attribution.
10. Citation & Links
If you use Pusa, please cite:
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen}
}
-
Project page: https://yaofang-liu.github.io/Pusa_Web/ -
Technical report: arXiv:2507.16116 -
Model & dataset: Hugging Face RaphaelLiu/PusaV1
Ready to try?
-
Clone repo -
Download weights -
Run your first prompt
Your next 30-minute coffee break could be the moment you generate a studio-quality video for less than the price of lunch.