Vivid-VR: Turning Blurry Footage into Cinematic Clarity with a Text-to-Video Transformer

Authors: Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen (Alibaba – Taobao & Tmall Group)
Paper: arXiv:2508.14483
Project page: https://csbhr.github.io/projects/vivid-vr/


1. Why Should You Care About Video Restoration?

If you have ever tried to upscale an old family video, salvage a live-stream recording, or polish AI-generated clips, you have probably asked:

“Photos can be enhanced—why not videos?”

Traditional tools either leave the footage smeared or create disturbing “AI faces.”
Pure diffusion image models fix one frame beautifully but give the next frame a new haircut.

Vivid-VR takes a different route:
it borrows the conceptual understanding built into large text-to-video (T2V) models and distills that knowledge into a restoration network.
The result is stable, photo-realistic, temporally coherent video—without the plastic look.


2. A 30-Second Overview

  1. Describe the low-quality video with an automatic captioner.
  2. Re-create a clean version using the same T2V model that made the caption.
  3. Restore the original clip by letting a lightweight ControlNet follow both the caption and the degraded frames.

3. How Vivid-VR Works, Step by Step

Stage Everyday Analogy Core Technique Section in Paper
1. Auto-caption “Write a short script of what’s happening.” CogVLM2-Video VLM 3.1
2. Concept alignment “Have the director re-shoot the scene.” T2V denoising guided by caption 3.2
3. Guided restoration “Match the new footage to the old camera angles.” ControlNet with dual-branch connector 3.1

4. Frequently Asked Questions

Q1: Can I just run Stable Diffusion frame-by-frame?

No.

  • It breaks temporal coherence—faces warp between frames.
  • Vivid-VR uses a Diffusion Transformer (DiT) that models motion natively, so the eyes stay in place.

Q2: Why do some T2V restoration models look “plasticky”?

Distribution drift.
Fine-tuning data rarely aligns perfectly with the giant pre-training corpus.
Vivid-VR fixes this by letting the T2V model re-synthesize the training pairs first, locking the distribution in place.

Q3: Will it run on my laptop?

  • Inference fits in 12 GB VRAM at 1024 × 1024 (RTX 3060/4060).
  • Training took 32 H20-96G GPUs for 6 000 GPU-hours; hobbyists can fine-tune LoRA adapters in ~3 days on 8×A100-80G.

5. Installation Guide (Copy-and-Paste Ready)

5.1 Clone the repository

git clone https://github.com/csbhr/Vivid-VR.git
cd Vivid-VR

5.2 Create a conda environment

conda create -n Vivid-VR python=3.10
conda activate Vivid-VR

# CUDA 12.1 wheels
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 \
            --index-url https://download.pytorch.org/whl/cu121

pip install -r requirements.txt

5.3 Download checkpoints

Place every file under ./ckpts/ exactly like this:

ckpts/
├── CogVideoX1.5-5B/           # main T2V backbone
├── cogvlm2-llama3-caption/    # automatic caption model
├── Vivid-VR/
│   ├── controlnet/
│   ├── connectors.pt
│   ├── control_feat_proj.pt
│   └── control_patch_embed.pt
├── easyocr/                   # optional text upscaler
└── RealESRGAN/

Direct links

  • CogVideoX1.5-5B: Hugging Face zai-org/CogVideoX1.5-5B
  • CogVLM2 caption: Hugging Face zai-org/cogvlm2-llama3-caption
  • Vivid-VR weights: Hugging Face csbhr/Vivid-VR
  • EasyOCR weights (optional):
    – https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip
    – https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip
    – https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip

5.4 Quick inference

python VRDiT/inference.py \
    --ckpt_dir ./ckpts \
    --cogvideox_ckpt_path ./ckpts/CogVideoX1.5-5B \
    --cogvlm2_ckpt_path ./ckpts/cogvlm2-llama3-caption \
    --input_dir /path/to/low_quality_videos \
    --output_dir /path/to/enhanced_videos \
    --upscale 0 \
    --textfix \
    --save_images
Flag Purpose
--upscale 0 output short side = 1024 px
--textfix sharpen on-screen text with Real-ESRGAN
--save_images also export PNG frames

6. Benchmarks & Real-World Test

Dataset Metric Real-ESRGAN SUPIR SeedVR-7B Vivid-VR
SPMCS (synthetic) DOVER ↑ 8.24 10.07 9.78 11.35
VideoLQ (real) NIQE ↓ 5.87 5.40 5.66 4.36
UGC50 (user clips) CLIP-IQA ↑ 0.35 0.38 0.38 0.45
AIGC50 (AI clips) MD-VQA ↑ 84.56 84.80 81.47 89.69

Tested on RTX 4090, 16-frame clips, 50 denoising steps.


7. Training Your Own Data (Tips from the Authors)

Factor Recommendation
Dataset size ≥20 000 clips, ≥10 s each, 720 p+
Captions regenerate noisy captions with CogVLM2-Video, then distill
Hardware 8×A100-80G LoRA fine-tune ≈ 3 days
Learning rate 1 e-4 cosine annealing
Steps 30 000 iterations (can stop earlier for domain-specific data)

8. Deep Dive: The Math Behind “Concept Distillation”

  1. Loss function (v-prediction)

    L = E‖v − vθ(x_t, x_lq, x_text, t)‖²
    
    • v = target velocity in latent space
    • x_t = noisy latent at step t
    • x_lq = low-quality video latent
    • x_text = text tokens from caption
  2. Distribution alignment trick

    • Take a high-quality clip.
    • Add noise to the halfway timestep (T/2).
    • Let CogVideoX1.5-5B denoise it conditioned on its own caption.
    • The result is an aligned pair (noisy, clean) that keeps the T2V prior intact.

9. Ablation Study Highlights

Setting NIQE ↓ DOVER ↑ Notes
w/o Control Feature Projector 4.62 13.98 artifacts leak into generation
w/o Dual-Branch Connector 5.18 13.04 poor detail recovery
w/o Concept Distillation 5.36 12.99 texture oversharpened

10. Known Limitations

  • Speed: 50-step DPM-Solver is not real-time (~2 fps on RTX 4090).
  • Very fast motion: sports scenes may show subtle ghost trails.
  • Memory: 1024 × 1024 × 37 frames peaks at ~18 GB; tiling or frame-windowing can help.

11. Quick Comparison Images

(Click to enlarge)

Input Bicubic SUPIR SeedVR-7B Vivid-VR
input bic supir seed vivid
  • Windows and doors stay consistent across frames.
  • Facial pores and animal fur regain natural detail.

12. Citation

If you use this work, please cite:

@article{bai2025vividvr,
  title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration},
  author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
  journal={arXiv preprint arXiv:2508.14483},
  year={2025},
  url={https://arxiv.org/abs/2508.14483}
}

13. License Summary

  • Vivid-VR code: Apache 2.0 (same as diffusers v0.31.0)
  • CogVideoX1.5-5B: CogVideoX License
  • CogVLM2 caption: CogVLM2 + LLaMA3 dual license
  • Real-ESRGAN: BSD-3-Clause
  • EasyOCR: JAIDED.AI Terms

14. One-Sentence Takeaway

Vivid-VR teaches a text-to-video model to re-imagine your degraded footage, then trains a lightweight ControlNet to reproduce that vision—giving you stable, cinematic, and artifact-free results straight out of the box.