Exploring LTX-2: How to Generate Synchronized Audio-Video with Open-Source Models

Summary

LTX-2 is a DiT-based audio-video foundation model that generates synchronized video and audio in a single framework, supporting high-fidelity outputs and multiple performance modes. Using its PyTorch codebase, you can run it locally to create videos with resolutions divisible by 32 and frame counts divisible by 8+1. The model features 19B-parameter dev and distilled versions, ideal for text-to-video or image-to-video tasks, with open weights and training capabilities.

What Is LTX-2? Why Should You Care About This Model?

Imagine wanting to create a short video where the visuals flow seamlessly with perfectly timed background music or sound effects—everything in sync without juggling multiple tools. In the past, this might have required complex workflows, but LTX-2 changes that. Developed by Lightricks, LTX-2 is the first DiT-based audio-video foundation model that packs all the core features of modern video generation into one system. Simply put, it lets you generate videos with audio directly from text or images, and it’s fully open-source.

If you’re a recent engineering graduate or developer dipping into AI-generated content, you might wonder: “What exactly can LTX-2 do?” It handles synchronized audio and video creation, delivers high-fidelity results, and offers various performance modes like fast inference or production-ready outputs. It even includes API access and open availability, making it easy to integrate into your projects. Don’t worry—I’ll guide you through getting started, from installation to actually generating videos.

At its heart, LTX-2 uses a diffusion model architecture, meaning it creates content through a step-by-step denoising process. Unlike traditional video tools, it emphasizes practicality and local execution, so you can run it on your own hardware without relying on cloud services. This is a huge plus for anyone experimenting with AI generation locally.

LTX-2 Model Checkpoints: Picking the Right Version for Your Needs

LTX-2 comes with several checkpoints, each tailored to specific use cases. Are you focused on training the model or just quick video generation? These versions cover those scenarios. Here’s a clear table to help you compare them at a glance.

Name	Description
ltx-2-19b-dev	Full model, supports bf16 training and flexible usage
ltx-2-19b-dev-fp8	Full model with fp8 quantization for reduced memory footprint
ltx-2-19b-dev-fp4	Full model with nvfp4 quantization for even more memory optimization
ltx-2-19b-distilled	Distilled version with fixed 8-step sampling and CFG=1, perfect for fast generation
ltx-2-19b-distilled-lora-384	LoRA adaptation of the distilled model, applicable to the full version
ltx-2-spatial-upscaler-x2-1.0	x2 spatial upscaler for multi-stage pipelines to boost resolution
ltx-2-temporal-upscaler-x2-1.0	x2 temporal upscaler for multi-stage pipelines to increase frame rates

All these checkpoints are hosted on Hugging Face for easy downloads. For instance, if you’re short on memory, go for the fp8 or fp4 variants to save resources. The dev version boasts 19B parameters, making it ideal for deep training, while the distilled one prioritizes speed—with just 8 steps for content creation.

You might ask: “Which one should I choose?” For beginners, start with ltx-2-19b-distilled—it strikes a great balance between quality and efficiency. Advanced users will appreciate the dev version for full training or fine-tuning flexibility.

Model Details: The Technical Foundations Behind LTX-2

LTX-2 is a diffusion-based audio-video foundation model designed primarily for English. It integrates video and audio generation seamlessly, ensuring that your outputs aren’t silent—audio like background tracks or ambient sounds syncs perfectly with visual actions.

Lightricks, the development team, focused on making this model practical. As a foundation model, it serves as a starting point for various tasks, such as text-to-video or image-to-video. With a 19B parameter scale, it ensures high-fidelity outputs but requires adequate computational resources.

A standout feature is its multi-stage pipeline support. Using spatial and temporal upscalers, you can begin with low-resolution generation and scale up to higher resolutions or frame rates. This approach is highly practical for production, as it lets you manage generation time and output quality effectively.

How to Try LTX-2 Online? Quick Demo Guide

Not ready to install yet? LTX-2 offers online demos for instant access. Head to LTX-Studio’s text-to-video page or image-to-video playground, where you can input prompts and generate content right in your browser.

For example, type in “A cat chasing butterflies in a park with birds chirping in the background,” and the model will produce a matching video with audio. This is an ideal entry point for newcomers—play around with the demo to gauge prompt effectiveness before diving into local setup.

Running LTX-2 Locally: Step-by-Step Installation Tutorial

Let’s dive into setting up LTX-2 on your machine. This is a straightforward how-to guide with numbered steps. Keep in mind, the codebase is PyTorch-based, requiring Python 3.12 or higher, CUDA 12.7+, and PyTorch around 2.7.

Step 1: Clone the Repository

Start by cloning from GitHub:

git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2

This pulls down the entire monorepo, including core model, pipelines, and training tools.

Step 2: Set Up the Environment

Sync dependencies using uv:

uv sync
source .venv/bin/activate

This creates a virtual environment with all necessary packages installed.

Step 3: Download Required Models

Grab checkpoints from Hugging Face. Essentials include:

LTX-2 model checkpoint (pick one, like ltx-2-19b-dev-fp8.safetensors)
Spatial upscaler: ltx-2-spatial-upscaler-x2-1.0.safetensors
Temporal upscaler: ltx-2-temporal-upscaler-x2-1.0.safetensors
Distilled LoRA: ltx-2-19b-distilled-lora-384.safetensors (for most pipelines)
Gemma text encoder: Download all files from google/gemma-3-12b-it-qat-q4_0-unquantized

Plus, various LoRAs for effects like camera control (e.g., ltx-2-19b-lora-camera-control-dolly-in.safetensors). These add features such as dolly-in shots or static framing.

Place them in the appropriate directory, typically the repo root or a specified path.

Step 4: Run Inference

Inference lives in the ltx-pipelines package. Check its README for details. Basically, import a pipeline like TI2VidTwoStagesPipeline, then generate from a prompt.

Sample Python script:

from ltx_pipelines import TI2VidTwoStagesPipeline

pipeline = TI2VidTwoStagesPipeline.from_pretrained(“ltx-2-19b-dev-fp8”)

video = pipeline(“A serene forest scene with birds chirping”)

This outputs a video file with integrated audio.

Usage License: What Can You Do with LTX-2?

LTX-2’s license is user-friendly—you can leverage the full model, distilled versions, upscalers, or derivatives as per the terms on Hugging Face. This covers personal projects, research, or even commercial applications, but always verify the specifics.

ComfyUI Integration: Graphical Interface for Video Generation

Prefer a no-code approach? ComfyUI is excellent. It includes built-in LTXVideo nodes installable via ComfyUI Manager. For manual setup, refer to the documentation site.

In ComfyUI, drag and drop nodes to build workflows—connect text prompts to the LTX-2 model and output videos. It’s great for visual debugging.

PyTorch Codebase: Diving into the Core

LTX-2’s codebase is a monorepo with key packages:

ltx-core: Model definitions, inference stack, and utilities
ltx-pipelines: High-level pipelines for text-to-video and more
ltx-trainer: Tools for training and fine-tuning LoRA or IC-LoRA

Each has its own README. After installation, explore ltx-pipelines’ README for inference examples.

Diffusers Support: Seamless Workflow Integration

LTX-2 works with the Diffusers library for image-to-video tasks. If you’re already using Diffusers, load the model and use its pipeline for generation.

Prompting Tips for LTX-2: Crafting Effective Descriptions

Prompts are crucial for LTX-2. You might wonder: “How do I write prompts that yield great videos?” Focus on detailed, chronological descriptions in a single flowing paragraph, including specific movements, appearances, camera angles, and environmental details.

Suggested structure:

Begin with the main action.
Add details on movements and gestures.
Describe appearances precisely.
Include background elements.
Specify camera movements.
Note lighting and colors.
Mention any changes or events.

Keep it under 200 words. Pipelines include an enhance_prompt option for automatic optimization.

Example: “A girl dancing in a sunlit garden wearing a flowing white dress, with buzzing bees and birdsong in the background, camera slowly zooming in on her smile.”

Widths and heights must be divisible by 32; frame counts by 8+1. If not, pad with -1 and crop accordingly.

Available Pipelines: Choosing Your Generation Mode

LTX-2 offers multiple pipelines for different needs:

TI2VidTwoStagesPipeline: Production-quality text-to-video with 2x upsampling (recommended)
TI2VidOneStagePipeline: Single-stage for rapid prototyping
DistilledPipeline: Fastest inference with 8 steps
ICLoraPipeline: For video-to-video or image-to-video transformations
KeyframeInterpolationPipeline: Interpolate between keyframe images

With DistilledPipeline, expect 8 steps in stage 1 and 4 in stage 2.

Optimization Tips: Making Generation Faster and Better

Looking to speed things up? Try these:

Use DistilledPipeline for fixed 8-step sampling.
Enable FP8: Via –enable-fp8 or fp8transformer=True to cut memory use.
Install xFormers or Flash Attention 3.
Apply gradient estimation: Reduce from 40 to 20-30 steps while preserving quality.
Skip memory cleanup if VRAM allows.
Opt for single-stage pipelines when high resolution isn’t essential.

These tweaks can dramatically reduce generation time without sacrificing output.

Limitations: What LTX-2 Isn’t Perfect For

To be transparent, LTX-2 has some constraints:

It doesn’t provide factual information.
It may amplify societal biases.
Outputs might not perfectly match prompts.
Prompt style heavily influences results.
It could generate inappropriate content.
Audio without speech may be lower quality.

Understanding these helps set realistic expectations.

How to Train LTX-2? Fine-Tuning Your Own Model

The base dev model is fully trainable. Using ltx-trainer, replicating published LoRAs and IC-LoRAs is straightforward. Training for motion, style, or likeness (sound + appearance) can take under an hour in many cases.

Refer to ltx-trainer’s README for step-by-step guidance. Ideal for customizing the model to your needs.

FAQ: Common Questions About LTX-2

What languages does LTX-2 support?

English only.

How much VRAM do I need for video generation?

It varies by version; fp8 needs less, but 24GB+ is recommended.

What can LoRAs do?

They add controls like camera dolly-in or canny edge detection.

What if my prompt is too short?

Results may be inaccurate—aim for detailed descriptions.

How to handle non-compliant resolutions?

Pad inputs with -1, then crop outputs.

What data is needed for training?

Check the trainer README for custom dataset support.

ComfyUI vs. PyTorch: Which is better?

ComfyUI for visual workflows; PyTorch for scripting.

Conclusion: Start Your LTX-2 Journey Today

LTX-2 unlocks new possibilities in audio-video generation, from local setups to custom training—all open-source and practical. Whether you’re creating fun clips or integrating into apps, it’s worth exploring. Try the demo, then go local—you’ll be amazed. Got questions? Join the Discord community for discussions.

LTX-2 Guide: How to Generate Audio-Video Locally with Open-Source Models