Video-Generation Models Can Also Be the Judge: How PRFL Finetunes a 14 B Model in 67 GB VRAM and Makes Motion 56 % Smoother
Train on every frame (720 P × 81) without blowing memory, speed the loop 1.4×, and push motion scores from 25 → 81. All done in latent space—no VAE decoding required.
1. Why a “Judge” Is Missing in Current Video Models
People type these questions into search boxes every day:
-
“AI video motion looks fake—how to fix?” -
“Finetune large video model with limited GPU memory?” -
“Which method checks physics consistency during generation?”
Classic pipelines give a reward only after pixels are clean. That single decision point causes three pain points:
| Pain Point | What It Means in Practice |
|---|---|
| Late supervision | Structure and motion are settled in early denoising steps; final-stage reward arrives too late to correct them. |
| Memory wall | Decoding 81 RGB frames for a vision-language model often triggers out-of-memory (OOM) on 80 GB cards. |
| Slow iteration | Waiting for full 40-step denoising + decoding every training step burns GPU hours. |
2. PRFL in One Sentence
Process Reward Feedback Learning (PRFL) lets the video generator itself score every intermediate noisy latent, then back-propagates that signal without ever calling the VAE decoder. You get gradient flow across the whole chain, early guidance for motion, and a 67 GB footprint.
3. Core Idea: Turn Generator into Judge
| Stage | RGB ReFL | PRFL Latent Way |
|---|---|---|
| Input | Clean pixels | Any-step noisy latent |
| Reward net | Extra vision-language model | First 8 DiT blocks of your generator |
| VAE decode? | Yes, full frames | Never |
| When to teach? | Final step only | Random step t ∈ [0,1) |
| Peak VRAM | OOM on 81 frames | 67 GB |
4. Two-Part Recipe (PAVRM + PRFL)
4.1 PAVRM—building the judge
-
Freeze VAE and text encoder; keep early DiT layers. -
One learnable query vector attends over space-time features → single 5120-D vector. -
Three-layer MLP outputs 0–1 quality score; train with binary cross-entropy on human labels.
4.2 PRFL—training the student
-
Sample random timestep t. -
Roll forward without gradients to t+Δt. -
One gradient step produces latent at t. -
Ask PAVRM for score; maximize it. -
Alternate with normal supervised-flow loss to avoid mode collapse.
5. Numbers First: Did It Work?
| Task | Metric | Baseline | PRFL | Δ |
|---|---|---|---|---|
| T2V 720 P | Dynamic Degree | 25.0 | 81.0 | +56 |
| T2V 720 P | Human Anatomy | 68.9 | 90.4 | +21.5 |
| I2V 480 P | Dynamic Degree | 57.0 | 87.0 | +30 |
| Training | Peak VRAM | 81 frames OOM | 67 GB | usable |
| Speed | Seconds/iter | 72.4 s | 51.1 s | 1.42× faster |
Dynamic Degree measures how much motion appears; Human Anatomy counts distorted faces, hands, torsos.
6. Human Evaluation
30 professionals, 2 250 pairwise votes.
Win rate: PRFL 63 %, competitors ≤21 %, ties 16 %.
Reasons cited: fewer body artifacts, smoother camera, better prompt adherence.
7. Inside the Code: Start-to-Finish Commands
7.1 Hardware & software
-
Linux + CUDA 12.4 -
≥80 GB GPU recommended (A100 80 GB / H800)
7.2 Install
git clone https://github.com/Tencent-Hunyuan/HY-Video-PRFL.git
cd HY-Video-PRFL
conda create -n hyprfl python=3.10
conda activate hyprfl
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 \
--index-url https://download.pytorch.org/whl/cu121
pip3 install -e .
export PYTHONPATH=./
7.3 Download pretrained weights
pip install -U "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P \
--local-dir ./weights/Wan2.1-I2V-14B-720P
7.4 Pre-process your videos (latent extraction)
python3 scripts/preprocess/gen_wanx_latent.py \
--config configs/pre_720.yaml
Edit json_path inside the YAML to point at your MP4 list.
7.5 Label each clip
Add two fields to the auto-generated meta_*.json:
{"physics_quality": 1, "human_quality": 1}
0 = fail, 1 = pass. Discard “partial”.
Collect paths into temp_data_720.list.
7.6 Train the reward model (8-GPU example)
torchrun --nproc_per_node=8 scripts/pavrm/train_pavrm.py \
--config configs/train_pavrm_i2v_720.yaml
7.7 Train the generator with PRFL
torchrun --nproc_per_node=8 scripts/prfl/train_prfl.py \
--config configs/train_prfl_i2v_720.yaml
The config already points to your new PAVRM checkpoint.
7.8 Inference (same script as base model)
export neg="over-exposed, static, distorted, extra fingers, ..."
torchrun --nproc_per_node=8 scripts/prfl/inference_prfl.py \
--dit_fsdp --t5_fsdp --task i2v-14B \
--ckpt_dir ./weights/Wan2.1-I2V-14B-720P \
--transformer_path <your-prfl-ckpt> \
--size 1280*720 --frame_num 81 --sample_steps 40 \
--negative_prompt "$neg" \
--save_folder outputs/my_run
8. Ablation Highlights (What Matters?)
| Factor | Setting | Best |
|---|---|---|
| Timestep sampling | Early / Mid / Late only | Full range 0–1 |
| DiT blocks trainable | 8–40 | 16 (trade-off) |
| Aggregation | Mean / Max / Attention-w/-Query | Attention-w/-Query |
| Loss | BCE vs Bradley-Terry | BCE (marginally higher, simpler) |
9. FAQ (Predicting Your Next Questions)
Q1: I have only 40 GB—can I still use PRFL?
A: Yes. Reduce frame_num to 41 or 21; the reward model still sees full latent statistics because of query attention.
Q2: Does PAVRM work for non-portrait scenes?
A: The published checkpoint is tuned on portrait deformities. Re-label your own data for drones, animals, or cityscapes.
Q3: Can I plug PRFL into Stable-Diffusion video models?
A: Any rectified-flow video DiT works as long as you keep the first 8 blocks for feature extraction.
Q4: How many video pairs for decent quality?
A: Authors used 24 k (real, generated) pairs. Users report noticeable gains with 5 k+ carefully labeled clips.
Q5: Is the code Apache / MIT?
A: GitHub repo carries Apache-2.0. Commercial use is allowed; cite the paper.
10. Current Limits & Where It Can Go
-
One reward head only → future multi-head (aesthetics, text-faithfulness, camera motion). -
Trained on 720 P / 480 P → needs check for 1080 P or higher. -
Pure post-train → can be merged with ControlNet-style conditions for editable video.
11. Key Takeaway
PRFL shows that the same model can create and critique. By moving judgment into the latent realm you:
-
save up to 40 % VRAM, -
speed training >1.4×, -
improve motion plausibility >20 points,
all without extra vision-language models at train time. If you hit the memory wall with conventional RGB rewards, give your generator a second job—it might be the judge you need.
Citation
Mi, X., Yu, W., Lian, J. et al. Video Generation Models are Good Latent Reward Models. arXiv:2511.21541, 2025.
