InfinityStar: Revolutionizing Video Generation with Unified Spacetime Autoregressive Modeling

高效码农

1 week ago

InfinityStar: Unified Spacetime Autoregressive Modeling for Visual Generation

Introduction: What is InfinityStar and How Does It Address Challenges in Visual Generation?

This article aims to answer the core question: What is InfinityStar, how does it unify image and video generation tasks, and why does it improve efficiency and quality? InfinityStar is a unified spacetime autoregressive framework designed for high-resolution image and dynamic video synthesis. It leverages recent advances in autoregressive modeling from both vision and language domains, using a purely discrete approach to jointly capture spatial and temporal dependencies in a single architecture.

Visual synthesis has seen remarkable advancements in recent years, driven by the scaling of Transformer architectures. Video generation, in particular, has gained significant attention from academia and industry due to its applications in content creation and world simulation. Currently, diffusion models dominate the field by iteratively denoising latent representations to create high-fidelity clips. Meanwhile, autoregressive models have been explored for their ability to unify image and video generation while generalizing to longer time horizons.

However, both paradigms have notable limitations. Video diffusion models excel at fixed-length sequences through bidirectional attention but require substantial computational resources for dozens or hundreds of denoising steps, making seamless video extrapolation difficult. Autoregressive methods, based on next-token prediction, enable streaming generation but often lag in visual fidelity and suffer from high latency due to thousands of inference steps.

InfinityStar addresses these issues with its spacetime pyramid modeling framework. It represents a video as an image pyramid combined with multiple clip pyramids, inheriting text-to-image capabilities while decoupling static appearance from dynamic motions in videos. Extensive experiments show InfinityStar achieving a score of 83.74 on VBench, outperforming all autoregressive models and even surpassing diffusion competitors like HunyuanVideo. Without additional optimizations, it generates a 5-second 720p video about 10 times faster than leading diffusion methods. As the first discrete autoregressive video generator capable of industrial-level 720p videos, InfinityStar promotes further research in efficient, high-quality video generation.

Image source: Project documentation

InfinityStar’s Core Architecture and Principles

Spacetime Pyramid Modeling: How Does It Unify Image and Video Generation?

This section addresses the core question: How does spacetime pyramid modeling enable unified generation of images and videos, and how is it applied in real-world scenarios? Spacetime pyramid modeling decomposes a video into sequential clip sequences, with each clip modeled as a 3D volume pyramid to handle both static and dynamic elements uniformly.

In InfinityStar, a video is broken down into clips {c1, c2, …, cN}, where the first clip c1 is a single frame (T=1) to encode static appearance, and subsequent clips each have duration T>1. Each clip features K scales, represented as residual token blocks rk with dimensions (T, hk, wk), extending only in the spatial dimension. Tokens in the first clip are generated autoregressively across scales:

p(r1_1, …, r1_K) = ∏_k=1}^K p(r1_k r1_1, …, r1_{k-1, ψ(t))

The autoregressive likelihood for the entire video is:

p(r1_1, …, rN_K) = ∏_c=1}^N ∏_{k=1}^K p(rc_k r1_1, …, rc_{k-1, ψ(t))

This design supports theoretically infinite video lengths. In practical scenarios, such as animation production, users input a text prompt to generate an initial static frame pyramid, then extend it clip by clip to add dynamic motions, ensuring seamless transitions from static images to extended videos.

Reflection: During development, I realized that decoupling appearance and motion was a pivotal insight, preventing fitting difficulties in coupled designs and enhancing the model’s flexibility across tasks.

Visual Tokenizer: How Does It Enhance Reconstruction Quality and Training Efficiency?

This section answers the core question: How does the visual tokenizer improve video reconstruction through knowledge inheritance and stochastic quantizer depth, and how is it implemented in training? The visual tokenizer compresses raw images or videos into discrete token sequences, with InfinityStar optimizing via inheritance from a continuous video tokenizer and stochastic quantizer depth.

Training video tokenizers is more challenging than image ones due to higher computational demands from multiple frames and imbalanced information distribution across scales. Most discrete video tokenizers start from scratch or pretrained image tokenizers, leading to inefficiency or suboptimal video reconstruction. InfinityStar inherits the architecture and weights from a trained continuous video tokenizer: the encoder compresses raw inputs into latents, transformed into K discrete residual token blocks using bitwise multi-scale residual quantization. Each block has hi × wi d-dimensional tokens with vocabulary size 2^d.

Experiments show pretrained video tokenizer weights yield PSNR of 22.6, outperforming pretrained image (16.4) and none (11.0), with faster convergence. Stochastic quantizer depth randomly drops the last N scales during training with probability p, balancing information.

In applications like video editing, users upload a video, which the tokenizer compresses into a pyramid for autoregressive extension. Code example:

# Initialize with pretrained video tokenizer
from tokenizer import VideoTokenizer

tokenizer = VideoTokenizer.from_pretrained('continuous_video_tokenizer')
# Quantize video
latents = tokenizer.encode(video)
residual_blocks = tokenizer.quantize(latents, stochastic_depth=True)

This approach significantly boosts detail fidelity in high-resolution video generation.

Autoregressive Transformer: How Does It Handle Spacetime Dependencies?

This section tackles the core question: How does the autoregressive Transformer capture dependencies in the spacetime pyramid, and how is it used in generation tasks? The autoregressive Transformer predicts the next residual token block conditioned on text embedding and prior blocks.

Based on the VAR framework, it predefines a scale schedule {(h1, w1), …, (hK, wK)}, forming a pyramid as scales grow. During training, it predicts p(rk | r<k, ψ(t)). In inference, it runs autoregressively K times, merges tokens, and decodes.

For videos, it extends to spacetime: the first clip acts as an image pyramid, with subsequent clips conditioned on priors. Semantic scale repetition refines early semantic scale predictions, improving video details and motion complexity.

In scenarios like interactive video synthesis, users provide an initial video and multiple prompts; the model generates extended videos clip by clip. For instance, in anime sequences, start with “cute kitten” for the first frame, then extend with “jumping motion.”

Reflection: A key lesson learned was that while a unified architecture is elegant, balancing spatial and temporal extensions is crucial to avoid flickering, as validated through pseudo-spacetime pyramid experiments.

InfinityStar’s Related Work and Comparisons

Video Diffusion Models: Where Does InfinityStar Hold Advantages?

This section explores the core question: Compared to video diffusion models, what advantages does InfinityStar offer in quality and efficiency, and how does it perform in benchmarks? Video diffusion models generate high-fidelity data by progressively denoising random noise but suffer from slow generation speeds that limit high-resolution, long-duration video production.

Early U-Net-based models produced sharp, temporally coherent frames but were limited by model capacity. Diffusion Transformers like DiT in SORA process spatiotemporal patches at scale, enhancing consistency and quality, sparking industry innovations. Despite excellent quality, diffusion models require multiple denoising steps.

InfinityStar, as an autoregressive model, scores 83.74 on VBench, surpassing HunyuanVideo (83.24). It generates 5s 720p videos 10 times faster. Benchmark table:

Model Type	VBench Score	Inference Speed (Relative to Diffusion)	Resolution Support
Diffusion Models	83.24	1x	720p
Autoregressive Models (InfinityStar)	83.74	10x	720p

In news video creation, InfinityStar quickly produces high-fidelity clips for real-time applications.

Video Autoregressive Models: What Are InfinityStar’s Innovations?

This section answers the core question: How does InfinityStar improve upon existing video autoregressive models, and how does it apply to generation efficiency? Video autoregressive models use Transformers to predict tokens along spatial and temporal axes but require hundreds to thousands of inference steps, leading to low efficiency.

Models like Emu3 predict next tokens across axes, while Nova handles spatial then temporal. Despite progress, efficiency remains poor. InfinityStar extends next-scale prediction to unified image-video generation, supporting zero-shot image-to-video and video extrapolation.

In anime generation, input an image, and the model creates video continuations more efficiently than traditional autoregressives.

Discrete Video Tokenizers: InfinityStar’s Enhancement Strategies

This section addresses the core question: How does InfinityStar improve discrete video tokenizers through knowledge inheritance, and how is it applied in reconstruction? Discrete and continuous video tokenizers developed independently, with misaligned architectures hindering knowledge reuse.

InfinityStar inherits from continuous video tokenizers, avoiding scratch training inefficiencies and image pretraining mismatches. Experiments show significant convergence boosts.

Application: In biological simulation videos, inherited weights enable quick high-detail frame reconstruction.

Training and Inference with InfinityStar

Training Scripts: How to Organize Data and Launch Training?

This section answers the core question: How do you use the provided training scripts to organize data and start InfinityStar training? The training process involves data organization, feature extraction, and script execution.

Install the environment: Use torch>=2.5.1 for FlexAttention, and pip install -r requirements.txt.

Data organization: Refer to data/README.md for large-scale video corpora supporting 720p and variable durations.

Launch training: Scripts cover tokenizer and Transformer stages. Example:

# Stage 1: Train tokenizer
python train_tokenizer.py --data_path /path/to/videos --pretrained continuous_video_tokenizer
# Stage 2: Train VAR Transformer
python train_var.py --tokenizer_path /path/to/trained_tokenizer --scale_schedule '1x1,2x2,...,HxW'

In world simulation datasets, this trains for infinite-length video generation.

Reflection: In training, stochastic quantizer depth proved essential for balancing information, preventing later scales from dominating optimization.

Inference Scripts: How to Generate Images and Videos?

This section tackles the core question: How do you use inference scripts to generate videos at different resolutions, and apply them in interactive modes? Inference supports 720p and 480p video generation.

For 720p: Use tools/infer_video_720p.py for text-to-video and image-to-video.

python3 tools/infer_video_720p.py --prompt "cute kitten jumping" --image_path /path/to/image.jpg

Generates 5s 720p videos.

For 480p variable lengths: Set generation_duration to 5 or 10, supporting image-to-video and video continuation.

python3 tools/infer_video_480p.py --duration 10 --video_path /path/to/input.mp4

For long interactive videos: tools/infer_interact_480p.py enables multi-prompt interaction.

python3 tools/infer_interact_480p.py --ref_video /path/to/ref.mp4 --prompts "prompt1,prompt2"

In product prototyping, this facilitates user-interactive content creation.

Image source: Unsplash (similar visual generation examples)

InfinityStar’s Benchmarks and Visualizations

Benchmark Performance: How Does InfinityStar Excel in Image and Video Generation?

This section answers the core question: How does InfinityStar outperform competitors in benchmarks, and how is it evaluated in practice? InfinityStar achieves state-of-the-art performance on image generation benchmarks.

On video benchmarks: VBench score of 83.74, surpassing diffusion models.

Human evaluations: Outperforms HunyuanVideo.

Application: In journalism, benchmarks validate quality for reporting videos.

Visualization Examples: What Do Actual Generations Look Like?

This section explores the core question: How do InfinityStar’s generated images and videos demonstrate its capabilities? Examples in general aesthetics, anime, and motion showcase high fidelity.

Text-to-image: Such as cute kitten images.

Image-to-video: Extending static images dynamically.

Video extrapolation: Continuing input videos.

Image source: Project documentation

Reflection: Visualizations highlighted autoregressive advantages in motion complexity due to clip-by-clip prediction, avoiding global denoising limitations.

Conclusion: The Value and Future of InfinityStar

InfinityStar’s unified spacetime autoregressive framework enables high-quality, efficient visual generation across tasks, suitable for content creation and beyond.

Future extensions could include longer videos and enhanced interactions.

Practical Summary / Action Checklist

Installation: pip install -r requirements.txt, with torch>=2.5.1.
Training: Organize data, run tokenizer and VAR scripts.
Inference: Use infer_video_720p.py for 720p; infer_interact_480p.py for interactions.
Evaluation: Refer to VBench benchmarks.

One-Page Summary (One-Page Overview)

InfinityStar: Unified spacetime autoregressive model supporting T2I, T2V, I2V, V2V. Architecture: Spacetime pyramid, inherited tokenizer, stochastic depth. Performance: VBench 83.74, 10x faster than diffusion. Training: Two-stage scripts. Inference: 720p/480p scripts, interactive support. Examples: Anime, motion videos.

Frequently Asked Questions (FAQ)

What generation tasks does InfinityStar support?
It supports text-to-image, text-to-video, image-to-video, and video extrapolation.
How does the tokenizer improve reconstruction?
Through knowledge inheritance from continuous video tokenizers and stochastic quantizer depth.
What is InfinityStar’s inference speed advantage?
It generates 5s 720p videos 10 times faster than diffusion models.
What environment is needed for training?
torch>=2.5.1 and packages from requirements.txt.
How to generate long interactive videos?
Use infer_interact_480p.py with a reference video and multiple prompts.
How does spacetime pyramid work?
It decomposes videos into clip pyramids, predicting autoregressively across scales.
What is InfinityStar’s benchmark score?
VBench 83.74, surpassing HunyuanVideo.
What resolutions and durations are supported?
720p at 5s, 480p at 5-10s, with potential for infinite extensions.