Video-Generation Models Can Also Be the Judge: How PRFL Finetunes a 14 B Model in 67 GB VRAM and Makes Motion 56 % Smoother

Train on every frame (720 P × 81) without blowing memory, speed the loop 1.4×, and push motion scores from 25 → 81. All done in latent space—no VAE decoding required.


1. Why a “Judge” Is Missing in Current Video Models

People type these questions into search boxes every day:

  • “AI video motion looks fake—how to fix?”
  • “Finetune large video model with limited GPU memory?”
  • “Which method checks physics consistency during generation?”

Classic pipelines give a reward only after pixels are clean. That single decision point causes three pain points:

Pain Point What It Means in Practice
Late supervision Structure and motion are settled in early denoising steps; final-stage reward arrives too late to correct them.
Memory wall Decoding 81 RGB frames for a vision-language model often triggers out-of-memory (OOM) on 80 GB cards.
Slow iteration Waiting for full 40-step denoising + decoding every training step burns GPU hours.

2. PRFL in One Sentence

Process Reward Feedback Learning (PRFL) lets the video generator itself score every intermediate noisy latent, then back-propagates that signal without ever calling the VAE decoder. You get gradient flow across the whole chain, early guidance for motion, and a 67 GB footprint.


3. Core Idea: Turn Generator into Judge

Stage RGB ReFL PRFL Latent Way
Input Clean pixels Any-step noisy latent
Reward net Extra vision-language model First 8 DiT blocks of your generator
VAE decode? Yes, full frames Never
When to teach? Final step only Random step t ∈ [0,1)
Peak VRAM OOM on 81 frames 67 GB

4. Two-Part Recipe (PAVRM + PRFL)

4.1 PAVRM—building the judge

  • Freeze VAE and text encoder; keep early DiT layers.
  • One learnable query vector attends over space-time features → single 5120-D vector.
  • Three-layer MLP outputs 0–1 quality score; train with binary cross-entropy on human labels.

4.2 PRFL—training the student

  1. Sample random timestep t.
  2. Roll forward without gradients to t+Δt.
  3. One gradient step produces latent at t.
  4. Ask PAVRM for score; maximize it.
  5. Alternate with normal supervised-flow loss to avoid mode collapse.

5. Numbers First: Did It Work?

Task Metric Baseline PRFL Δ
T2V 720 P Dynamic Degree 25.0 81.0 +56
T2V 720 P Human Anatomy 68.9 90.4 +21.5
I2V 480 P Dynamic Degree 57.0 87.0 +30
Training Peak VRAM 81 frames OOM 67 GB usable
Speed Seconds/iter 72.4 s 51.1 s 1.42× faster

Dynamic Degree measures how much motion appears; Human Anatomy counts distorted faces, hands, torsos.


6. Human Evaluation

30 professionals, 2 250 pairwise votes.
Win rate: PRFL 63 %, competitors ≤21 %, ties 16 %.
Reasons cited: fewer body artifacts, smoother camera, better prompt adherence.


7. Inside the Code: Start-to-Finish Commands

7.1 Hardware & software

  • Linux + CUDA 12.4
  • ≥80 GB GPU recommended (A100 80 GB / H800)

7.2 Install

git clone https://github.com/Tencent-Hunyuan/HY-Video-PRFL.git
cd HY-Video-PRFL
conda create -n hyprfl python=3.10
conda activate hyprfl
pip3 install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 \
            --index-url https://download.pytorch.org/whl/cu121
pip3 install -e .
export PYTHONPATH=./

7.3 Download pretrained weights

pip install -U "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P \
            --local-dir ./weights/Wan2.1-I2V-14B-720P

7.4 Pre-process your videos (latent extraction)

python3 scripts/preprocess/gen_wanx_latent.py \
        --config configs/pre_720.yaml

Edit json_path inside the YAML to point at your MP4 list.

7.5 Label each clip

Add two fields to the auto-generated meta_*.json:

{"physics_quality": 1, "human_quality": 1}

0 = fail, 1 = pass. Discard “partial”.
Collect paths into temp_data_720.list.

7.6 Train the reward model (8-GPU example)

torchrun --nproc_per_node=8 scripts/pavrm/train_pavrm.py \
         --config configs/train_pavrm_i2v_720.yaml

7.7 Train the generator with PRFL

torchrun --nproc_per_node=8 scripts/prfl/train_prfl.py \
         --config configs/train_prfl_i2v_720.yaml

The config already points to your new PAVRM checkpoint.

7.8 Inference (same script as base model)

export neg="over-exposed, static, distorted, extra fingers, ..."

torchrun --nproc_per_node=8 scripts/prfl/inference_prfl.py \
  --dit_fsdp --t5_fsdp --task i2v-14B \
  --ckpt_dir ./weights/Wan2.1-I2V-14B-720P \
  --transformer_path <your-prfl-ckpt> \
  --size 1280*720 --frame_num 81 --sample_steps 40 \
  --negative_prompt "$neg" \
  --save_folder outputs/my_run

8. Ablation Highlights (What Matters?)

Factor Setting Best
Timestep sampling Early / Mid / Late only Full range 0–1
DiT blocks trainable 8–40 16 (trade-off)
Aggregation Mean / Max / Attention-w/-Query Attention-w/-Query
Loss BCE vs Bradley-Terry BCE (marginally higher, simpler)

9. FAQ (Predicting Your Next Questions)

Q1: I have only 40 GB—can I still use PRFL?
A: Yes. Reduce frame_num to 41 or 21; the reward model still sees full latent statistics because of query attention.

Q2: Does PAVRM work for non-portrait scenes?
A: The published checkpoint is tuned on portrait deformities. Re-label your own data for drones, animals, or cityscapes.

Q3: Can I plug PRFL into Stable-Diffusion video models?
A: Any rectified-flow video DiT works as long as you keep the first 8 blocks for feature extraction.

Q4: How many video pairs for decent quality?
A: Authors used 24 k (real, generated) pairs. Users report noticeable gains with 5 k+ carefully labeled clips.

Q5: Is the code Apache / MIT?
A: GitHub repo carries Apache-2.0. Commercial use is allowed; cite the paper.


10. Current Limits & Where It Can Go

  • One reward head only → future multi-head (aesthetics, text-faithfulness, camera motion).
  • Trained on 720 P / 480 P → needs check for 1080 P or higher.
  • Pure post-train → can be merged with ControlNet-style conditions for editable video.

11. Key Takeaway

PRFL shows that the same model can create and critique. By moving judgment into the latent realm you:

  • save up to 40 % VRAM,
  • speed training >1.4×,
  • improve motion plausibility >20 points,

all without extra vision-language models at train time. If you hit the memory wall with conventional RGB rewards, give your generator a second job—it might be the judge you need.


Citation

Mi, X., Yu, W., Lian, J. et al. Video Generation Models are Good Latent Reward Models. arXiv:2511.21541, 2025.