From 500: How the New Pusa V1.0 Video Model Slashes Training Costs Without Cutting Corners

A plain-language guide for developers, artists, and small teams who want high-quality video generation on a tight budget.


TL;DR

  • Problem: Training a state-of-the-art image-to-video (I2V) model usually costs ≥ $100 k and needs ≥ 10 million clips.
  • Solution: Pusa V1.0 uses vectorized timesteps—a tiny change in how noise is handled—so you can reach the same quality with $500 and 4 000 clips.
  • Outcome: One checkpoint runs text-to-video, image-to-video, start-to-end frames, video extension, and transition tasks without extra training.
  • Time to first clip: 30 minutes on an 8-GPU node (or 2 hours if you rent cloud GPUs).

1. What Makes Pusa Different?

Traditional video diffusion models treat every frame like one big block: the same noise level is added or removed at every step. That is simple, but it forces the model to relearn “keep the first frame unchanged” every time you ask it to animate a still picture. Pusa breaks this block into individual frame clocks—each frame gets its own noise level.

Traditional Scalar Timestep Pusa Vectorized Timestep
One slider controls all 16 frames 16 independent sliders
Needs giant datasets to learn frame locking Frame locking is handled by code, not data
Fine-tuning can overwrite text-to-video skills Original skills stay intact

2. Quick Start: Run Your First Clip in 30 Minutes

2.1 One-Command Install

git clone https://github.com/Yaofang-Liu/Pusa-VidGen.git
cd Pusa-VidGen/PusaV1
pip install uv                # fast Python package manager
uv venv .venv && source .venv/bin/activate
uv pip install -r requirements.txt

2.2 Pull the Weights

huggingface-cli download RaphaelLiu/PusaV1 --local-dir ./models
# or unzip the manual download into ./models

2.3 Text-to-Video (T2V) Example

python scripts/infer_t2v.py \
  --model_dir ./models \
  --prompt "A slow-motion close-up of a hummingbird hovering near red flowers" \
  --num_steps 10 \
  --output hummingbird.mp4

Result: 81-frame 24 fps clip, ready for social media.

2.4 Image-to-Video (I2V) Example

python scripts/infer_i2v.py \
  --model_dir ./models \
  --image_path ./assets/photo.jpg \
  --prompt "The camera slowly pulls back revealing a beach sunset" \
  --noise_multiplier 0.2 \
  --num_steps 10 \
  --output sunset_zoom.mp4

2.5 Start-and-End Frame Interpolation

python scripts/infer_start_end.py \
  --model_dir ./models \
  --start_frame ./assets/start.jpg \
  --end_frame ./assets/end.jpg \
  --prompt "Smooth cinematic dolly between two viewpoints" \
  --num_steps 10 \
  --output dolly.mp4

3. Under the Hood: Why Vector Timesteps Work

3.1 The Math in One Paragraph

Instead of a single scalar t ∈ [0,1] that tells the whole video how noisy it is, Pusa gives every frame its own scalar τᵢ ∈ [0,1]. Training becomes a frame-aware flow-matching problem: the network only needs to learn how each pixel should move given its personal clock. Keeping the first frame unchanged is now a sampling trick, not a new training objective.

3.2 Training Loop in 90 Seconds

  1. Pick 3 860 clips from the open-source VBench 2.0 set (already captioned).
  2. Sample each frame’s τᵢ uniformly at random (no sync steps needed).
  3. Fine-tune the Wan-T2V-14B backbone with LoRA rank-512 for 900 steps.
  4. Cost: 8×80 GB A100s × 2 h ≈ $500 on most cloud spot markets.

3.3 What Actually Changes in the Code?

  • Line 1: Time-embedding layer Linear(1, D)Linear(N, D)
  • Line 2: Each DiT block now receives per-frame modulation parameters
    That is it—no new attention heads, no extra losses.

4. Step-by-Step Fine-Tuning Guide

4.1 Prepare Your Own Dataset

my_dataset/
 ├─ clip_0001.mp4
 ├─ clip_0001.txt
 ├─ clip_0002.mp4
 └─ clip_0002.txt

Each .txt file contains the caption for the matching video.

4.2 Launch Training (LoRA)

accelerate launch --config_file ds_config_zero2.yaml \
  train_lora.py \
  --base_model ./models/Wan-T2V-14B \
  --data_root ./my_dataset \
  --rank 512 \
  --alpha 1.7 \
  --lr 1e-4 \
  --max_steps 900 \
  --output_dir ./runs/my_pusa_lora

4.3 Merge LoRA for Faster Inference

python scripts/merge_lora.py \
  --base_model ./models/Wan-T2V-14B \
  --lora_path ./runs/my_pusa_lora/checkpoint-900 \
  --output_path ./models/my_pusa_full

5. Benchmarks: How Much Quality Do You Lose?

Metric Pusa LoRA Wan-I2V (baseline)
VBench-I2V Total Score 87.32 % 86.86 %
Subject Consistency 97.64 % 96.95 %
Background Consistency 99.24 % 96.44 %
Training Budget $500 ≥ $100 k
Dataset Size 4 k clips ≥ 10 M clips

Bonus: Pusa still scores 95 % on the original text-to-video task, proving no catastrophic forgetting.


6. Common Use-Case Recipes

Goal Script Pro Tip
Product demo from a photo infer_i2v.py --noise_multiplier 0.2 adds gentle motion
Storyboard preview infer_start_end.py lock first and last storyboard frames
Extend an existing ad infer_extension.py use last 3 frames as context
Multi-keyframe ad infer_multi_frames.py supply 3-5 keyframes at 10-frame intervals
Quick social post infer_t2v.py --num_steps 4 for speed, 10 for quality

7. Troubleshooting Quick Sheet

Issue Quick Fix
Out-of-memory on 24 GB GPU Add --offload or use gradient checkpointing
First frame too static Increase --noise_multiplier to 0.3-0.5
Colors look washed out Encode source image in sRGB not Adobe RGB
Long generation time Reduce --num_steps to 4-6 for previews
LoRA not loading Ensure --alpha matches training value (1.7)

8. Version History & Roadmap

Version Base Model Abilities Release
V0.5 Mochi-1-Preview T2V, I2V, extension Apr 2025
V1.0 Wan-T2V-14B + start-end, transition Jul 2025
V1.1 Wan-T2V-14B 60-second clips (planned) TBD
V2.0 Next open model 4 K, longer context Community driven

9. Frequently Asked Questions

Q1: Do I need to retrain for each new task?
No. One LoRA checkpoint handles T2V, I2V, start-end, extension, and transitions out of the box.

Q2: How big is the download?

  • Model weights ~ 14 GB
  • LoRA weights ~ 300 MB

Q3: Can I run this on a single RTX 4090?
Yes. Use --offload and 25-step inference; 60-second clips finish in ~ 4 minutes.

Q4: Is commercial use allowed?
Check the Wan-T2V-14B license and LoRA weights license; both currently permit commercial use with attribution.


10. Citation & Links

If you use Pusa, please cite:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen}
}
  • Project page: https://yaofang-liu.github.io/Pusa_Web/
  • Technical report: arXiv:2507.16116
  • Model & dataset: Hugging Face RaphaelLiu/PusaV1

Ready to try?

  1. Clone repo
  2. Download weights
  3. Run your first prompt

Your next 30-minute coffee break could be the moment you generate a studio-quality video for less than the price of lunch.