From $100 k t o$ 500: How the New Pusa V1.0 Video Model Slashes Training Costs Without Cutting Corners

A plain-language guide for developers, artists, and small teams who want high-quality video generation on a tight budget.

TL;DR

Problem: Training a state-of-the-art image-to-video (I2V) model usually costs ≥ $100 k and needs ≥ 10 million clips.
Solution: Pusa V1.0 uses vectorized timesteps—a tiny change in how noise is handled—so you can reach the same quality with $500 and 4 000 clips.
Outcome: One checkpoint runs text-to-video, image-to-video, start-to-end frames, video extension, and transition tasks without extra training.
Time to first clip: 30 minutes on an 8-GPU node (or 2 hours if you rent cloud GPUs).

1. What Makes Pusa Different?

Traditional video diffusion models treat every frame like one big block: the same noise level is added or removed at every step. That is simple, but it forces the model to relearn “keep the first frame unchanged” every time you ask it to animate a still picture. Pusa breaks this block into individual frame clocks—each frame gets its own noise level.

Traditional Scalar Timestep	Pusa Vectorized Timestep
One slider controls all 16 frames	16 independent sliders
Needs giant datasets to learn frame locking	Frame locking is handled by code, not data
Fine-tuning can overwrite text-to-video skills	Original skills stay intact

2. Quick Start: Run Your First Clip in 30 Minutes

2.1 One-Command Install

git clone https://github.com/Yaofang-Liu/Pusa-VidGen.git
cd Pusa-VidGen/PusaV1
pip install uv                # fast Python package manager
uv venv .venv && source .venv/bin/activate
uv pip install -r requirements.txt

2.2 Pull the Weights

huggingface-cli download RaphaelLiu/PusaV1 --local-dir ./models
# or unzip the manual download into ./models

2.3 Text-to-Video (T2V) Example

python scripts/infer_t2v.py \
  --model_dir ./models \
  --prompt "A slow-motion close-up of a hummingbird hovering near red flowers" \
  --num_steps 10 \
  --output hummingbird.mp4

Result: 81-frame 24 fps clip, ready for social media.

2.4 Image-to-Video (I2V) Example

python scripts/infer_i2v.py \
  --model_dir ./models \
  --image_path ./assets/photo.jpg \
  --prompt "The camera slowly pulls back revealing a beach sunset" \
  --noise_multiplier 0.2 \
  --num_steps 10 \
  --output sunset_zoom.mp4

2.5 Start-and-End Frame Interpolation

python scripts/infer_start_end.py \
  --model_dir ./models \
  --start_frame ./assets/start.jpg \
  --end_frame ./assets/end.jpg \
  --prompt "Smooth cinematic dolly between two viewpoints" \
  --num_steps 10 \
  --output dolly.mp4

3. Under the Hood: Why Vector Timesteps Work

3.1 The Math in One Paragraph

Instead of a single scalar t ∈ [0,1] that tells the whole video how noisy it is, Pusa gives every frame its own scalar τᵢ ∈ [0,1]. Training becomes a frame-aware flow-matching problem: the network only needs to learn how each pixel should move given its personal clock. Keeping the first frame unchanged is now a sampling trick, not a new training objective.

3.2 Training Loop in 90 Seconds

Pick 3 860 clips from the open-source VBench 2.0 set (already captioned).
Sample each frame’s τᵢ uniformly at random (no sync steps needed).
Fine-tune the Wan-T2V-14B backbone with LoRA rank-512 for 900 steps.
Cost: 8×80 GB A100s × 2 h ≈ $500 on most cloud spot markets.

3.3 What Actually Changes in the Code?

Line 1: Time-embedding layer Linear(1, D) → Linear(N, D)
Line 2: Each DiT block now receives per-frame modulation parameters
That is it—no new attention heads, no extra losses.

4. Step-by-Step Fine-Tuning Guide

4.1 Prepare Your Own Dataset

my_dataset/
 ├─ clip_0001.mp4
 ├─ clip_0001.txt
 ├─ clip_0002.mp4
 └─ clip_0002.txt

Each .txt file contains the caption for the matching video.

4.2 Launch Training (LoRA)

accelerate launch --config_file ds_config_zero2.yaml \
  train_lora.py \
  --base_model ./models/Wan-T2V-14B \
  --data_root ./my_dataset \
  --rank 512 \
  --alpha 1.7 \
  --lr 1e-4 \
  --max_steps 900 \
  --output_dir ./runs/my_pusa_lora

4.3 Merge LoRA for Faster Inference

python scripts/merge_lora.py \
  --base_model ./models/Wan-T2V-14B \
  --lora_path ./runs/my_pusa_lora/checkpoint-900 \
  --output_path ./models/my_pusa_full

5. Benchmarks: How Much Quality Do You Lose?

Metric	Pusa LoRA	Wan-I2V (baseline)
VBench-I2V Total Score	87.32 %	86.86 %
Subject Consistency	97.64 %	96.95 %
Background Consistency	99.24 %	96.44 %
Training Budget	$500	≥ $100 k
Dataset Size	4 k clips	≥ 10 M clips

Bonus: Pusa still scores 95 % on the original text-to-video task, proving no catastrophic forgetting.

6. Common Use-Case Recipes

Goal	Script	Pro Tip
Product demo from a photo	`infer_i2v.py`	`--noise_multiplier 0.2` adds gentle motion
Storyboard preview	`infer_start_end.py`	lock first and last storyboard frames
Extend an existing ad	`infer_extension.py`	use last 3 frames as context
Multi-keyframe ad	`infer_multi_frames.py`	supply 3-5 keyframes at 10-frame intervals
Quick social post	`infer_t2v.py`	`--num_steps 4` for speed, 10 for quality

7. Troubleshooting Quick Sheet

Issue	Quick Fix
Out-of-memory on 24 GB GPU	Add `--offload` or use gradient checkpointing
First frame too static	Increase `--noise_multiplier` to 0.3-0.5
Colors look washed out	Encode source image in sRGB not Adobe RGB
Long generation time	Reduce `--num_steps` to 4-6 for previews
LoRA not loading	Ensure `--alpha` matches training value (1.7)

8. Version History & Roadmap

Version	Base Model	Abilities	Release
V0.5	Mochi-1-Preview	T2V, I2V, extension	Apr 2025
V1.0	Wan-T2V-14B	+ start-end, transition	Jul 2025
V1.1	Wan-T2V-14B	60-second clips (planned)	TBD
V2.0	Next open model	4 K, longer context	Community driven

9. Frequently Asked Questions

Q1: Do I need to retrain for each new task?
No. One LoRA checkpoint handles T2V, I2V, start-end, extension, and transitions out of the box.

Q2: How big is the download?

Model weights ~ 14 GB
LoRA weights ~ 300 MB

Q3: Can I run this on a single RTX 4090?
Yes. Use --offload and 25-step inference; 60-second clips finish in ~ 4 minutes.

Q4: Is commercial use allowed?
Check the Wan-T2V-14B license and LoRA weights license; both currently permit commercial use with attribution.

10. Citation & Links

If you use Pusa, please cite:

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen}
}

Project page: https://yaofang-liu.github.io/Pusa_Web/
Technical report: arXiv:2507.16116
Model & dataset: Hugging Face RaphaelLiu/PusaV1

Ready to try?

Clone repo
Download weights
Run your first prompt

Your next 30-minute coffee break could be the moment you generate a studio-quality video for less than the price of lunch.

How Pusa V1.0 Video Model Slashes Training Costs from $100K to $500 Without Compromising Quality

From 100kto500: How the New Pusa V1.0 Video Model Slashes Training Costs Without Cutting Corners