VideoX-Fun: A Comprehensive Guide to AI Video Generation

高效码农

3 months ago

😊 Welcome!

CogVideoX-Fun:

Wan-Fun:

Introduction
Quick Start
Video Examples
How to Use
Model Addresses
References
License

Introduction

VideoX-Fun is a video generation pipeline that can be used to generate AI images and videos, train baseline models and Lora models for Diffusion Transformers. It supports direct prediction from pre-trained baseline models to generate videos with different resolutions, durations, and frame rates (FPS). Additionally, it allows users to train their own baseline models and Lora models for style customization.

We will gradually support quick launches from different platforms. Please refer to Quick Start for more information.

New Features:

Updated Wan2.1-Fun-V1.1 version: Supports 14B and 1.3B model Control + reference image models, camera control. In addition, the Inpaint model has been retrained with improved performance. [2025.04.25]
Updated Wan2.1-Fun-V1.0 version: Supports I2V and Control models for 14B and 1.3B models, and supports start-end image prediction. [2025.03.26]
Updated CogVideoX-Fun-V1.5 version: Uploaded I2V model and related training and prediction code. [2024.12.16]
Reward Lora support: Train Lora through reward backpropagation technology to optimize generated videos, making them better aligned with human preferences. For more information, see more details. The new version of the control model supports different control conditions such as Canny, Depth, Pose, MLSD, etc. [2024.11.21]
Diffusers support: CogVideoX-Fun Control is now supported in diffusers. Thanks to a-r-r-o-w for contributing support in this PR. Check the documentation for more information. [2024.10.16]
Updated CogVideoX-Fun-V1.1 version: Retrained the i2v model, added Noise to increase the motion amplitude of the video. Uploaded control model training code and Control models. [2024.09.29]
Updated CogVideoX-Fun-V1.0 version: Code created! Now supports Windows and Linux. Supports video generation with arbitrary resolutions from 256x256x49 to 1024x1024x49 for 2b and 5b models. [2024.09.18]

Our UI interface is as follows:

Quick Start

Coming soon! We are working on providing quick start guides for different platforms to help you get started with VideoX-Fun easily. Stay tuned for updates on how to quickly launch and use the pipeline on various operating systems and environments.

Video Examples

Wan2.1-Fun-V1.1-14B-InP & Wan2.1-Fun-V1.1-1.3B-InP

CogVideoX-Fun-V1.1-5B

Resolution-1024

Resolution-768

Resolution-512

CogVideoX-Fun-V1.1-5B-Control


A young woman with beautiful clear eyes and blonde hair, wearing white clothes and twisting her body, with the camera focused on her face. High quality, masterpiece, best quality, high resolution, ultra-fine, dreamlike.	A young woman with beautiful clear eyes and blonde hair, wearing white clothes and twisting her body, with the camera focused on her face. High quality, masterpiece, best quality, high resolution, ultra-fine, dreamlike.	A young bear.

Wan2.2-VACE-Fun-A14B

Generic Control Video + Reference Image:

Reference Image

Control Video

Wan2.2-VACE-Fun-14B

How to Use

1. Video Generation

c. Running Python Files

i. Single-card Operation:

- validation_video is the reference video for video-to-video generation. You can use the following video for demonstration: [Demo Video](https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/cogvideox_fun/asset/v1/play_guitar.mp4)
- Then run the examples/cogvideox_fun/predict_v2v.py file and wait for the generation result, which will be saved in the samples/cogvideox-fun-videos_v2v folder.

Normal controlled video generation (Canny, Pose, Depth, etc.):
- Modify control_video, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_v2v_control.py file.
- control_video is the control video for controlled video generation, which is a video extracted using operators such as Canny, Pose, Depth, etc. You can use the following video for demonstration: Demo Video
- Then run the examples/cogvideox_fun/predict_v2v_control.py file and wait for the generation result, which will be saved in the samples/cogvideox-fun-videos_v2v_control folder.
Step 3: If you want to combine other backbones trained by yourself with Lora, modify examples/{model_name}/predict_i2v.py and lora_path in examples/{model_name}/predict_t2v.py as appropriate.

2. Model Training

A complete model training process should include data preprocessing and Video DiT training. The training process for different models is similar, and the data formats are also similar:

a. Data Preprocessing

We provide a simple demo for training a lora model using image data. For details, you can check the wiki.

A complete data preprocessing process including long video segmentation, cleaning, and description can be performed with reference to the README in the video caption section.

If you want to train a text-to-image-video generation model, you need to arrange the dataset in this format:

📦 project/
├── 📂 datasets/
│   ├── 📂 internal_datasets/
│       ├── 📂 train/
│       │   ├── 📄 00000001.mp4
│       │   ├── 📄 00000002.jpg
│       │   └── 📄 .....
│       └── 📄 json_of_internal_datasets.json

json_of_internal_datasets.json is a standard json file. The file_path in the json can be set as a relative path, as shown below:

[
    {
      "file_path": "train/00000001.mp4",
      "text": "A group of young men in suits and sunglasses are walking down a city street.",
      "type": "video"
    },
    {
      "file_path": "train/00000002.jpg",
      "text": "A group of young men in suits and sunglasses are walking down a city street.",
      "type": "image"
    },
    .....
]

You can also set the path as an absolute path:

[
    {
      "file_path": "/mnt/data/videos/00000001.mp4",
      "text": "A group of young men in suits and sunglasses are walking down a city street.",
      "type": "video"
    },
    {
      "file_path": "/mnt/data/train/00000001.jpg",
      "text": "A group of young men in suits and sunglasses are walking down a city street.",
      "type": "image"
    },
    .....
]

b. Video DiT Training

If the data format during data preprocessing is a relative path, enter scripts/{model_name}/train.sh and make the following settings:

export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/json_of_internal_datasets.json"

If the data format is an absolute path, enter scripts/train.sh and make the following settings:

export DATASET_NAME=""
export DATASET_META_NAME="/mnt/data/json_of_internal_datasets.json"

Finally, run scripts/train.sh:

sh scripts/train.sh

For details on some parameter settings:

For Wan2.1-Fun, you can check Readme Train and Readme Lora.
For Wan2.1, you can check Readme Train and Readme Lora.
For CogVideoX-Fun, you can check Readme Train and Readme Lora.

Model Addresses

1. Wan2.2-Fun

Name	Storage Space	Hugging Face	Model Scope	Description
Wan2.2-Fun-5B-InP	23.0 GB	🤗Link	😄Link	Wan2.2-Fun-5B text-to-video and image-to-video weights, trained with 121 frames at 24 frames per second, supporting start-end image prediction.
Wan2.2-Fun-5B-Control	23.0 GB	🤗Link	😄Link	Wan2.2-Fun-5B video control weights, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and also supporting trajectory control. Trained with 121 frames at 24 frames per second, supporting multilingual prediction.
Wan2.2-Fun-5B-Control-Camera	23.0 GB	🤗Link	😄Link	Wan2.2-Fun-5B camera lens control weights. Trained with 121 frames at 24 frames per second, supporting multilingual prediction.

2. Wan2.2

Name	Hugging Face	Model Scope	Description
Wan2.2-TI2V-5B	🤗Link	😄Link	Wan2.2-5B text and image to video weights
Wan2.2-T2V-A14B	🤗Link	😄Link	Wan2.2-14B text to video weights
Wan2.2-I2V-A14B	🤗Link	😄Link	Wan2.2-14B image to video weights

3. Wan2.1-Fun

V1.1:

Name	Storage Space	Hugging Face	Model Scope	Description
Wan2.1-Fun-V1.1-1.3B-InP	19.0 GB	🤗Link	😄Link	Wan2.1-Fun-V1.1-1.3B text-to-video and image-to-video weights, trained with multiple resolutions, supporting start-end image prediction.
Wan2.1-Fun-V1.1-14B-InP	47.0 GB	🤗Link	😄Link	Wan2.1-Fun-V1.1-14B text-to-video and image-to-video weights, trained with multiple resolutions, supporting start-end image prediction.
Wan2.1-Fun-V1.1-1.3B-Control	19.0 GB	🤗Link	😄Link	Wan2.1-Fun-V1.1-1.3B video control weights support different control conditions such as Canny, Depth, Pose, MLSD, etc., support reference image + control condition for control, and support trajectory control. Support multi-resolution (512, 768, 1024) video prediction, trained with 81 frames at 16 frames per second, supporting multilingual prediction.
Wan2.1-Fun-V1.1-14B-Control	47.0 GB	🤗Link	😄Link	Wan2.1-Fun-V1.1-14B video control weights support different control conditions such as Canny, Depth, Pose, MLSD, etc., support reference image + control condition for control, and support trajectory control. Support multi-resolution (512, 768, 1024) video prediction, trained with 81 frames at 16 frames per second, supporting multilingual prediction.
Wan2.1-Fun-V1.1-1.3B-Control-Camera	19.0 GB	🤗Link	😄Link	Wan2.1-Fun-V1.1-1.3B camera lens control weights. Support multi-resolution (512, 768, 1024) video prediction, trained with 81 frames at 16 frames per second, supporting multilingual prediction.

4. Wan2.1

Name	Hugging Face	Model Scope	Description
Wan2.1-T2V-1.3B	🤗Link	😄Link	Wan2.1-1.3B text-to-video weights
Wan2.1-T2V-14B	🤗Link	😄Link	Wan2.1-14B text-to-video weights
Wan2.1-I2V-14B-480P	🤗Link	😄Link	Wan2.1-14B-480P image-to-video weights
Wan2.1-I2V-14B-720P	🤗Link	😄Link	Wan2.1-14B-720P image-to-video weights

5. CogVideoX-Fun

V1.5:

Name	Storage Space	Hugging Face	Model Scope	Description
CogVideoX-Fun-V1.5-5b-InP	20.0 GB	🤗Link	😄Link	Official image-to-video weights. Support multi-resolution (512, 768, 1024) video prediction, trained with 85 frames at 8 frames per second.
CogVideoX-Fun-V1.5-Reward-LoRAs	–	🤗Link	😄Link	Official reward backpropagation technology model, optimizing videos generated by CogVideoX-Fun-V1.5 to better meet human preferences.