LongCat-Video: Building the Foundation Model for Long-Form Video Generation
「Core question: Why did Meituan build a new video generation model?」
Video generation is not just about creating moving images — it’s about building world models that can simulate dynamic reality.
LongCat-Video is Meituan’s first large-scale foundation model designed to understand and generate temporally coherent, realistic, and long-duration videos.
1. The New Era of Long-Form Video Generation
「Core question: What problem does LongCat-Video solve?」
Most text-to-video models today can only produce a few seconds of coherent footage.
As time extends, problems appear:
-
「Color drift」 between frames -
「Inconsistent motion」 or abrupt scene cuts -
「Disappearing characters or background reconstruction errors」
「LongCat-Video」 directly tackles these limitations.
With 「13.6 billion parameters」, it unifies three major tasks within a single framework:
-
「Text-to-Video (T2V)」 -
「Image-to-Video (I2V)」 -
「Video Continuation (VC)」
It’s not just a generator — it’s a unified video understanding and synthesis model that scales up temporal reasoning.
2. Unified Model for Three Video Tasks
「Core question: What can LongCat-Video actually do?」
| Task Type | Input | Output | Typical Use Case |
|---|---|---|---|
| Text-to-Video | Text prompt | Video | Storytelling, marketing, creative production |
| Image-to-Video | Single image | Video | Product animation, cinematic rendering |
| Video Continuation | Partial video | Extended video | Long-form content, digital humans, scene extension |
LongCat-Video automatically adapts to each task by detecting how many frames are provided as “conditions”:
-
「No condition frame →」 text-to-video -
「Single frame →」 image-to-video -
「Multiple frames →」 video continuation
This unified input format allows shared representations and efficient fine-tuning across tasks — a major advantage for industrial-scale deployment.
3. Inside the Architecture: The Diffusion Transformer Core
「Core question: How does the model achieve unified video generation?」
LongCat-Video is built on a 「single-stream Diffusion Transformer (DiT)」, optimized for spatiotemporal data.
Each transformer block integrates:
-
「3D Self-Attention」 – models temporal dynamics across frames -
「Cross-Attention」 – aligns visual tokens with textual context -
「SwiGLU Activation + AdaLN-Zero Modulation」 – ensures stable training and expressive scaling
Additional architectural components:
-
「3D RoPE positional encoding」 for precise temporal representation -
「RMSNorm + QKNorm」 to stabilize long-sequence gradients -
「WAN 2.1 VAE」 for latent compression at 4×16×16 -
「umT5 text encoder」 for bilingual (Chinese/English) text conditioning
When given a prompt like “a cat stretching under sunlight”, the model translates it into high-dimensional latent motion, producing temporally continuous, visually coherent video.
4. Data Pipeline: Millions of Videos, Curated for Learning
「Core question: Where does LongCat-Video’s capability come from?」
The foundation of any generative model is data quality, not parameter count.
The LongCat team designed a rigorous two-stage 「data curation pipeline」.
Stage 1: Preprocessing
-
「Deduplication」 – by video ID and MD5 hash -
「Scene segmentation」 – using PySceneDetect + TransNetV2 -
「Black-border cropping」 – via FFmpeg optimization -
「Efficient packaging」 – for streaming data loading
Stage 2: Annotation
Each video clip receives multi-dimensional annotations:
| Label Type | Key Attributes |
|---|---|
| Basic metadata | duration, resolution, fps, bitrate |
| Visual quality | blur level, aesthetic score, watermark |
| Dynamic metrics | optical flow stability |
| Semantic consistency | text–video alignment score |
Beyond basic labels, multi-modal LLMs like 「LLaVA-Video」 and 「Qwen2.5VL」 enrich the data with semantic layers:
-
Camera motion (pan, tilt, zoom, dolly) -
Shot types (close-up, medium, wide) -
Style classification (realistic, 2D, 3D render) -
Bilingual captioning and summarization
❝
「Author’s reflection:」
Labeling isn’t just data cleaning — it’s world reconstruction.
Every tag helps the model learn how humans perceive and describe motion, a cornerstone for world-model intelligence.❞
5. Training Strategy: From Base Model to Human Preference
「Core question: How does the model learn visual and temporal alignment?」
Training proceeds in three main stages.
Stage 1: Base Model with Flow Matching
Replacing traditional diffusion sampling, LongCat-Video adopts 「Flow Matching」, where the model learns to predict the velocity field that transforms noise into clear video frames.
Loss function:
L = E[‖v_pred(xt, c, t; θ) - (x0 - ε)‖²]
Progressive training is applied:
「image → low-res video → multi-task joint learning」, improving semantic, motion, and temporal understanding step by step.
Stage 2: Human Feedback Optimization (RLHF)
The model applies 「Group Relative Policy Optimization (GRPO)」 with several innovations:
-
「Fixed-time sampling」 to stabilize temporal credit assignment -
「Gradient reweighting」 for noisy stages -
「Max-group standard deviation」 for reward balance
A 「multi-reward system」 ensures balanced learning:
| Reward Type | Objective | Evaluation Model |
|---|---|---|
| Visual Quality (VQ) | Clarity, aesthetic, fidelity | HPSv3 |
| Motion Quality (MQ) | Continuity, realism | VideoAlign |
| Text Alignment (TA) | Semantic match | VideoAlign |
This prevents reward hacking and encourages stable convergence between clarity, motion, and textual alignment.
Stage 3: Accelerated Refinement
Two optimization techniques boost performance dramatically:
-
「Coarse-to-Fine (C2F) Generation」:
Generate a rough 480p/15fps video first, then refine it to 720p/30fps using expert LoRA layers — achieving 「10× faster inference」. -
「Block Sparse Attention (BSA)」:
Compute only 10% of relevant attention weights, keeping quality almost intact.
❝
「Author’s insight:」
This is the true mark of engineering excellence: turning a research-level model into a deployable production engine.❞
6. Inference Efficiency: Minutes to Generate 720p Video
「Core question: How fast can it run in practice?」
| Mode | Steps | Resolution | Time | Speedup |
|---|---|---|---|---|
| Baseline | 50 | 720p | 1429s | 1× |
| LoRA Distilled | 16 | 720p | 244s | 5.8× |
| Coarse-to-Fine | 16/5 | 720p | 135s | 10.6× |
| C2F + Sparse Attention | 16/5 | 720p | 116s | 12.3× |
With these optimizations, a 「single H800 GPU」 can generate a full 720p 30fps video in just a few minutes.
7. Quick Start Guide
「Core question: How can I try LongCat-Video locally?」
# Clone repository
git clone https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video
# Create environment
conda create -n longcat-video python=3.10
conda activate longcat-video
# Install dependencies
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.7.4.post1
pip install -r requirements.txt
Download weights:
pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
Run demo scripts:
# Text-to-Video
torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Image-to-Video
torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
# Video Continuation
torchrun run_demo_video_continuation.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
8. Benchmark Results: Competitive and Efficient
「Core question: How does it perform compared to other open models?」
Text-to-Video Evaluation
| Model | TextAlign | Visual | Motion | Overall |
|---|---|---|---|---|
| Veo 3 | 3.99 | 3.23 | 3.86 | 3.48 |
| PixVerse V5 | 3.81 | 3.13 | 3.81 | 3.36 |
| Wan 2.2 T2V | 3.70 | 3.26 | 3.78 | 3.35 |
| 「LongCat-Video」 | 3.76 | 3.25 | 3.74 | 3.38 |
Image-to-Video Evaluation
| Model | ImgAlign | TextAlign | Visual | Motion | Overall |
|---|---|---|---|---|---|
| Seedance 1.0 | 4.12 | 3.70 | 3.22 | 3.77 | 3.35 |
| Hailuo 02 | 4.18 | 3.85 | 3.18 | 3.80 | 3.27 |
| Wan 2.2 I2V | 4.18 | 3.33 | 3.23 | 3.79 | 3.26 |
| 「LongCat-Video」 | 4.04 | 3.49 | 3.27 | 3.59 | 3.17 |
Even with a smaller parameter count (13.6B vs 28B), LongCat-Video maintains an excellent balance between quality, coherence, and efficiency.
9. Open Source and Responsible Use
「Core question: What should users know before deploying?」
-
Fully open-sourced under the 「MIT License」 -
Trademarks and patents remain with Meituan -
Users must comply with content safety and data regulations -
The model is still under evaluation for downstream robustness — self-assessment is advised before deployment
10. Conclusion: From Video Generation to World Modeling
LongCat-Video is not just about rendering pixels — it’s about teaching AI to understand time, causality, and the continuous nature of the real world.
From its unified architecture to the multi-reward RLHF and refined data pipeline, it showcases what industrial-grade AIGC truly looks like.
❝
「Author’s reflection:」
The future of world models won’t just store knowledge — they’ll encode how the world evolves.
Every frame LongCat-Video generates is a glimpse into how machines begin to understand time itself.❞
🧭 Practical Summary / Quick Checklist
-
「Download」: Hugging Face – meituan-longcat/LongCat-Video
-
「Core Tasks」: Text-to-Video, Image-to-Video, Video Continuation
-
「Run Command」:
torchrun run_demo_text_to_video.py -
「Performance」: 720p 30fps in minutes (single H800 GPU)
-
「Key Innovations」:
-
Unified multi-task architecture -
Coarse-to-Fine generation -
Block Sparse Attention -
Multi-Reward GRPO optimization
-
-
「License」: MIT (commercial use allowed)
📄 One-Page Summary
| Dimension | Key Facts |
|---|---|
| Parameters | 13.6B |
| Framework | Diffusion Transformer |
| Tasks | T2V / I2V / VC unified |
| Strength | Long-form temporal stability |
| Acceleration | C2F + Sparse Attention |
| Reward | Multi-Reward GRPO |
| Data | Multi-stage curation + semantic labeling |
| License | MIT License |
❓FAQ
「Q1: What GPU is required?」
A single H800 (80GB) GPU can handle 720p generation efficiently.
「Q2: Does it support Chinese prompts?」
Yes. The umT5 encoder natively supports bilingual text.
「Q3: Can it be used commercially?」
Yes, under the MIT license, as long as legal compliance is maintained.
「Q4: How long can generated videos be?」
The model natively supports minute-long continuation via the VC task.
「Q5: Can I run it on consumer GPUs?」
480p generation runs on A100 or RTX 4090; long-form 720p is recommended for enterprise GPUs.
「Q6: How do I customize prompts?」
Use the --prompt argument in demo scripts, e.g. “a dog running in the snow”.
「Q7: Will training code be released?」
Core components and inference scripts are already public; training modules will open progressively.
「Project Page:」 https://meituan-longcat.github.io/longcatvideo/
「Model Repository:」 Hugging Face – LongCat-Video

