PRFL: Train 14B Video Generation Models in 67GB VRAM with 56% Smoother Motion

高效码农

2 months ago

Video-Generation Models Can Also Be the Judge: How PRFL Finetunes a 14 B Model in 67 GB VRAM and Makes Motion 56 % Smoother

Train on every frame (720 P × 81) without blowing memory, speed the loop 1.4×, and push motion scores from 25 → 81. All done in latent space—no VAE decoding required.

1. Why a “Judge” Is Missing in Current Video Models

People type these questions into search boxes every day:

“AI video motion looks fake—how to fix?”
“Finetune large video model with limited GPU memory?”
“Which method checks physics consistency during generation?”

Classic pipelines give a reward only after pixels are clean. That single decision point causes three pain points:

Pain Point	What It Means in Practice
Late supervision	Structure and motion are settled in early denoising steps; final-stage reward arrives too late to correct them.
Memory wall	Decoding 81 RGB frames for a vision-language model often triggers out-of-memory (OOM) on 80 GB cards.
Slow iteration	Waiting for full 40-step denoising + decoding every training step burns GPU hours.

2. PRFL in One Sentence

Process Reward Feedback Learning (PRFL) lets the video generator itself score every intermediate noisy latent, then back-propagates that signal without ever calling the VAE decoder. You get gradient flow across the whole chain, early guidance for motion, and a 67 GB footprint.

3. Core Idea: Turn Generator into Judge

Stage	RGB ReFL	PRFL Latent Way
Input	Clean pixels	Any-step noisy latent
Reward net	Extra vision-language model	First 8 DiT blocks of your generator
VAE decode?	Yes, full frames	Never
When to teach?	Final step only	Random step t ∈ [0,1)
Peak VRAM	OOM on 81 frames	67 GB

4. Two-Part Recipe (PAVRM + PRFL)

4.1 PAVRM—building the judge

Freeze VAE and text encoder; keep early DiT layers.
One learnable query vector attends over space-time features → single 5120-D vector.
Three-layer MLP outputs 0–1 quality score; train with binary cross-entropy on human labels.

4.2 PRFL—training the student

Sample random timestep t.
Roll forward without gradients to t+Δt.
One gradient step produces latent at t.
Ask PAVRM for score; maximize it.
Alternate with normal supervised-flow loss to avoid mode collapse.

5. Numbers First: Did It Work?

Task	Metric	Baseline	PRFL	Δ
T2V 720 P	Dynamic Degree	25.0	81.0	+56
T2V 720 P	Human Anatomy	68.9	90.4	+21.5
I2V 480 P	Dynamic Degree	57.0	87.0	+30
Training	Peak VRAM	81 frames OOM	67 GB	usable
Speed	Seconds/iter	72.4 s	51.1 s	1.42× faster

Dynamic Degree measures how much motion appears; Human Anatomy counts distorted faces, hands, torsos.

6. Human Evaluation

30 professionals, 2 250 pairwise votes.
Win rate: PRFL 63 %, competitors ≤21 %, ties 16 %.
Reasons cited: fewer body artifacts, smoother camera, better prompt adherence.

7. Inside the Code: Start-to-Finish Commands

7.1 Hardware & software

Linux + CUDA 12.4
≥80 GB GPU recommended (A100 80 GB / H800)

7.2 Install

git clone https://github.com/Tencent-Hunyuan/HY-Video-PRFL.git
cd HY-Video-PRFL
conda create -n hyprfl python=3.10
conda activate hyprfl
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 \
            --index-url https://download.pytorch.org/whl/cu121
pip3 install -e .
export PYTHONPATH=./

7.3 Download pretrained weights

pip install -U "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P \
            --local-dir ./weights/Wan2.1-I2V-14B-720P

7.4 Pre-process your videos (latent extraction)

python3 scripts/preprocess/gen_wanx_latent.py \
        --config configs/pre_720.yaml

Edit json_path inside the YAML to point at your MP4 list.

7.5 Label each clip

Add two fields to the auto-generated meta_*.json:

{"physics_quality": 1, "human_quality": 1}

0 = fail, 1 = pass. Discard “partial”.
Collect paths into temp_data_720.list.

7.6 Train the reward model (8-GPU example)

torchrun --nproc_per_node=8 scripts/pavrm/train_pavrm.py \
         --config configs/train_pavrm_i2v_720.yaml

7.7 Train the generator with PRFL

torchrun --nproc_per_node=8 scripts/prfl/train_prfl.py \
         --config configs/train_prfl_i2v_720.yaml

The config already points to your new PAVRM checkpoint.

7.8 Inference (same script as base model)

export neg="over-exposed, static, distorted, extra fingers, ..."

torchrun --nproc_per_node=8 scripts/prfl/inference_prfl.py \
  --dit_fsdp --t5_fsdp --task i2v-14B \
  --ckpt_dir ./weights/Wan2.1-I2V-14B-720P \
  --transformer_path <your-prfl-ckpt> \
  --size 1280*720 --frame_num 81 --sample_steps 40 \
  --negative_prompt "$neg" \
  --save_folder outputs/my_run

8. Ablation Highlights (What Matters?)

Factor	Setting	Best
Timestep sampling	Early / Mid / Late only	Full range 0–1
DiT blocks trainable	8–40	16 (trade-off)
Aggregation	Mean / Max / Attention-w/-Query	Attention-w/-Query
Loss	BCE vs Bradley-Terry	BCE (marginally higher, simpler)

9. FAQ (Predicting Your Next Questions)

Q1: I have only 40 GB—can I still use PRFL?
A: Yes. Reduce frame_num to 41 or 21; the reward model still sees full latent statistics because of query attention.

Q2: Does PAVRM work for non-portrait scenes?
A: The published checkpoint is tuned on portrait deformities. Re-label your own data for drones, animals, or cityscapes.

Q3: Can I plug PRFL into Stable-Diffusion video models?
A: Any rectified-flow video DiT works as long as you keep the first 8 blocks for feature extraction.

Q4: How many video pairs for decent quality?
A: Authors used 24 k (real, generated) pairs. Users report noticeable gains with 5 k+ carefully labeled clips.

Q5: Is the code Apache / MIT?
A: GitHub repo carries Apache-2.0. Commercial use is allowed; cite the paper.

10. Current Limits & Where It Can Go

One reward head only → future multi-head (aesthetics, text-faithfulness, camera motion).
Trained on 720 P / 480 P → needs check for 1080 P or higher.
Pure post-train → can be merged with ControlNet-style conditions for editable video.

11. Key Takeaway

PRFL shows that the same model can create and critique. By moving judgment into the latent realm you:

save up to 40 % VRAM,
speed training >1.4×,
improve motion plausibility >20 points,

all without extra vision-language models at train time. If you hit the memory wall with conventional RGB rewards, give your generator a second job—it might be the judge you need.

Citation

Mi, X., Yu, W., Lian, J. et al. Video Generation Models are Good Latent Reward Models. arXiv:2511.21541, 2025.