LongCat-Video: Building the Foundation Model for Long-Form Video Generation

「Core question: Why did Meituan build a new video generation model?」
Video generation is not just about creating moving images — it’s about building world models that can simulate dynamic reality.
LongCat-Video is Meituan’s first large-scale foundation model designed to understand and generate temporally coherent, realistic, and long-duration videos.


1. The New Era of Long-Form Video Generation

「Core question: What problem does LongCat-Video solve?」

Most text-to-video models today can only produce a few seconds of coherent footage.
As time extends, problems appear:

  • 「Color drift」 between frames
  • 「Inconsistent motion」 or abrupt scene cuts
  • 「Disappearing characters or background reconstruction errors」

「LongCat-Video」 directly tackles these limitations.
With 「13.6 billion parameters」, it unifies three major tasks within a single framework:

  • 「Text-to-Video (T2V)」
  • 「Image-to-Video (I2V)」
  • 「Video Continuation (VC)」

It’s not just a generator — it’s a unified video understanding and synthesis model that scales up temporal reasoning.


2. Unified Model for Three Video Tasks

「Core question: What can LongCat-Video actually do?」

Task Type Input Output Typical Use Case
Text-to-Video Text prompt Video Storytelling, marketing, creative production
Image-to-Video Single image Video Product animation, cinematic rendering
Video Continuation Partial video Extended video Long-form content, digital humans, scene extension

LongCat-Video automatically adapts to each task by detecting how many frames are provided as “conditions”:

  • 「No condition frame →」 text-to-video
  • 「Single frame →」 image-to-video
  • 「Multiple frames →」 video continuation

This unified input format allows shared representations and efficient fine-tuning across tasks — a major advantage for industrial-scale deployment.


3. Inside the Architecture: The Diffusion Transformer Core

「Core question: How does the model achieve unified video generation?」

LongCat-Video is built on a 「single-stream Diffusion Transformer (DiT)」, optimized for spatiotemporal data.

Each transformer block integrates:

  1. 「3D Self-Attention」 – models temporal dynamics across frames
  2. 「Cross-Attention」 – aligns visual tokens with textual context
  3. 「SwiGLU Activation + AdaLN-Zero Modulation」 – ensures stable training and expressive scaling

Additional architectural components:

  • 「3D RoPE positional encoding」 for precise temporal representation
  • 「RMSNorm + QKNorm」 to stabilize long-sequence gradients
  • 「WAN 2.1 VAE」 for latent compression at 4×16×16
  • 「umT5 text encoder」 for bilingual (Chinese/English) text conditioning

When given a prompt like “a cat stretching under sunlight”, the model translates it into high-dimensional latent motion, producing temporally continuous, visually coherent video.


4. Data Pipeline: Millions of Videos, Curated for Learning

「Core question: Where does LongCat-Video’s capability come from?」

The foundation of any generative model is data quality, not parameter count.
The LongCat team designed a rigorous two-stage 「data curation pipeline」.

Stage 1: Preprocessing

  • 「Deduplication」 – by video ID and MD5 hash
  • 「Scene segmentation」 – using PySceneDetect + TransNetV2
  • 「Black-border cropping」 – via FFmpeg optimization
  • 「Efficient packaging」 – for streaming data loading

Stage 2: Annotation

Each video clip receives multi-dimensional annotations:

Label Type Key Attributes
Basic metadata duration, resolution, fps, bitrate
Visual quality blur level, aesthetic score, watermark
Dynamic metrics optical flow stability
Semantic consistency text–video alignment score

Beyond basic labels, multi-modal LLMs like 「LLaVA-Video」 and 「Qwen2.5VL」 enrich the data with semantic layers:

  • Camera motion (pan, tilt, zoom, dolly)
  • Shot types (close-up, medium, wide)
  • Style classification (realistic, 2D, 3D render)
  • Bilingual captioning and summarization

「Author’s reflection:」
Labeling isn’t just data cleaning — it’s world reconstruction.
Every tag helps the model learn how humans perceive and describe motion, a cornerstone for world-model intelligence.


5. Training Strategy: From Base Model to Human Preference

「Core question: How does the model learn visual and temporal alignment?」

Training proceeds in three main stages.

Stage 1: Base Model with Flow Matching

Replacing traditional diffusion sampling, LongCat-Video adopts 「Flow Matching」, where the model learns to predict the velocity field that transforms noise into clear video frames.

Loss function:

L = E[‖v_pred(xt, c, t; θ) - (x0 - ε)‖²]

Progressive training is applied:
「image → low-res video → multi-task joint learning」, improving semantic, motion, and temporal understanding step by step.

Stage 2: Human Feedback Optimization (RLHF)

The model applies 「Group Relative Policy Optimization (GRPO)」 with several innovations:

  • 「Fixed-time sampling」 to stabilize temporal credit assignment
  • 「Gradient reweighting」 for noisy stages
  • 「Max-group standard deviation」 for reward balance

A 「multi-reward system」 ensures balanced learning:

Reward Type Objective Evaluation Model
Visual Quality (VQ) Clarity, aesthetic, fidelity HPSv3
Motion Quality (MQ) Continuity, realism VideoAlign
Text Alignment (TA) Semantic match VideoAlign

This prevents reward hacking and encourages stable convergence between clarity, motion, and textual alignment.

Stage 3: Accelerated Refinement

Two optimization techniques boost performance dramatically:

  • 「Coarse-to-Fine (C2F) Generation」:
    Generate a rough 480p/15fps video first, then refine it to 720p/30fps using expert LoRA layers — achieving 「10× faster inference」.
  • 「Block Sparse Attention (BSA)」:
    Compute only 10% of relevant attention weights, keeping quality almost intact.

「Author’s insight:」
This is the true mark of engineering excellence: turning a research-level model into a deployable production engine.


6. Inference Efficiency: Minutes to Generate 720p Video

「Core question: How fast can it run in practice?」

Mode Steps Resolution Time Speedup
Baseline 50 720p 1429s
LoRA Distilled 16 720p 244s 5.8×
Coarse-to-Fine 16/5 720p 135s 10.6×
C2F + Sparse Attention 16/5 720p 116s 12.3×

With these optimizations, a 「single H800 GPU」 can generate a full 720p 30fps video in just a few minutes.


7. Quick Start Guide

「Core question: How can I try LongCat-Video locally?」

# Clone repository
git clone https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

# Create environment
conda create -n longcat-video python=3.10
conda activate longcat-video

# Install dependencies
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.7.4.post1
pip install -r requirements.txt

Download weights:

pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video

Run demo scripts:

# Text-to-Video
torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

# Image-to-Video
torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

# Video Continuation
torchrun run_demo_video_continuation.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

8. Benchmark Results: Competitive and Efficient

「Core question: How does it perform compared to other open models?」

Text-to-Video Evaluation

Model TextAlign Visual Motion Overall
Veo 3 3.99 3.23 3.86 3.48
PixVerse V5 3.81 3.13 3.81 3.36
Wan 2.2 T2V 3.70 3.26 3.78 3.35
「LongCat-Video」 3.76 3.25 3.74 3.38

Image-to-Video Evaluation

Model ImgAlign TextAlign Visual Motion Overall
Seedance 1.0 4.12 3.70 3.22 3.77 3.35
Hailuo 02 4.18 3.85 3.18 3.80 3.27
Wan 2.2 I2V 4.18 3.33 3.23 3.79 3.26
「LongCat-Video」 4.04 3.49 3.27 3.59 3.17

Even with a smaller parameter count (13.6B vs 28B), LongCat-Video maintains an excellent balance between quality, coherence, and efficiency.


9. Open Source and Responsible Use

「Core question: What should users know before deploying?」

  • Fully open-sourced under the 「MIT License」
  • Trademarks and patents remain with Meituan
  • Users must comply with content safety and data regulations
  • The model is still under evaluation for downstream robustness — self-assessment is advised before deployment

10. Conclusion: From Video Generation to World Modeling

LongCat-Video is not just about rendering pixels — it’s about teaching AI to understand time, causality, and the continuous nature of the real world.
From its unified architecture to the multi-reward RLHF and refined data pipeline, it showcases what industrial-grade AIGC truly looks like.

「Author’s reflection:」
The future of world models won’t just store knowledge — they’ll encode how the world evolves.
Every frame LongCat-Video generates is a glimpse into how machines begin to understand time itself.


🧭 Practical Summary / Quick Checklist

  • 「Download」: Hugging Face – meituan-longcat/LongCat-Video

  • 「Core Tasks」: Text-to-Video, Image-to-Video, Video Continuation

  • 「Run Command」: torchrun run_demo_text_to_video.py

  • 「Performance」: 720p 30fps in minutes (single H800 GPU)

  • 「Key Innovations」:

    • Unified multi-task architecture
    • Coarse-to-Fine generation
    • Block Sparse Attention
    • Multi-Reward GRPO optimization
  • 「License」: MIT (commercial use allowed)


📄 One-Page Summary

Dimension Key Facts
Parameters 13.6B
Framework Diffusion Transformer
Tasks T2V / I2V / VC unified
Strength Long-form temporal stability
Acceleration C2F + Sparse Attention
Reward Multi-Reward GRPO
Data Multi-stage curation + semantic labeling
License MIT License

❓FAQ

「Q1: What GPU is required?」
A single H800 (80GB) GPU can handle 720p generation efficiently.

「Q2: Does it support Chinese prompts?」
Yes. The umT5 encoder natively supports bilingual text.

「Q3: Can it be used commercially?」
Yes, under the MIT license, as long as legal compliance is maintained.

「Q4: How long can generated videos be?」
The model natively supports minute-long continuation via the VC task.

「Q5: Can I run it on consumer GPUs?」
480p generation runs on A100 or RTX 4090; long-form 720p is recommended for enterprise GPUs.

「Q6: How do I customize prompts?」
Use the --prompt argument in demo scripts, e.g. “a dog running in the snow”.

「Q7: Will training code be released?」
Core components and inference scripts are already public; training modules will open progressively.


「Project Page:」 https://meituan-longcat.github.io/longcatvideo/
「Model Repository:」 Hugging Face – LongCat-Video