StableAvatar: Generating Infinite-Length Audio-Driven Avatar Videos with AI
The field of artificial intelligence is continuously evolving, and one of the most exciting challenges researchers and developers face is creating virtual avatars that can speak, sing, or perform based solely on audio input—without limitations on video length. Meet StableAvatar, a groundbreaking solution designed to tackle this very problem.
This advanced AI model can generate high-fidelity, identity-consistent avatar videos of theoretically infinite length, entirely from a reference image and an audio clip. What sets it apart is its complete end-to-end generation capability—it does not rely on any external face-processing tools like FaceFusion, GFP-GAN, or CodeFormer.
What Is StableAvatar?
StableAvatar is an audio-driven video generation model based on a Diffusion Transformer (DiT). By providing a single reference image and an audio track, the model synthesizes a video in which the avatar’s lip movements, expressions, and gestures are synchronized with the audio—and it can do so for videos of any duration.
🔥 Project Page: https://francis-rings.github.io/StableAvatar
📄 Research Paper: https://arxiv.org/abs/2508.08248
🧠 Model Weights: https://huggingface.co/FrancisRing/StableAvatar
Why Is This a Breakthrough?
Existing audio-driven video generation models typically suffer from several limitations:
-
Inability to generate long videos without losing lip synchronization; -
Identity inconsistency—the character gradually looks less like the reference; -
Dependence on post-processing tools, making the pipeline complex and inefficient.
StableAvatar introduces several innovations to overcome these challenges:
-
Time-step-aware Audio Adapter: Prevents error accumulation during generation and avoids latent distribution drift between video segments. -
Audio Native Guidance Mechanism: Enhances audio-video synchronization during inference. -
Dynamic Weighted Sliding-window Strategy: Improves temporal coherence and smoothness in long-duration videos.
Demonstrations of StableAvatar
Below are sample videos generated by StableAvatar. All results are raw outputs without any post-processing:
https://github.com/user-attachments/assets/d7eca208-6a14-46af-b337-fb4d2b66ba8d
https://github.com/user-attachments/assets/b5902ac4-8188-4da8-b9e6-6df280690ed1
https://github.com/user-attachments/assets/87faa5c1-a118-4a03-a071-45f18e87e6a0
https://github.com/user-attachments/assets/531eb413-8993-4f8f-9804-e3c5ec5794d4
https://github.com/user-attachments/assets/cdc603e2-df46-4cf8-a14e-1575053f996f
https://github.com/user-attachments/assets/7022dc93-f705-46e5-b8fc-3a3fb755795c
https://github.com/user-attachments/assets/0ba059eb-ff6f-4d94-80e6-f758c613b737
https://github.com/user-attachments/assets/03e6c1df-85c6-448d-b40d-aacb8add4e45
https://github.com/user-attachments/assets/90b78154-dda0-4eaa-91fd-b5485b718a7f
How Does StableAvatar Work?
The following diagram illustrates the overall framework of StableAvatar:

The key stages include:
-
Input: Reference image + driving audio; -
Processing: Custom training and inference modules; -
Output: High-quality video of unlimited length.
Unlike previous approaches that often depend on third-party audio feature extractors, StableAvatar integrates audio processing internally. This deep fusion of audio and visual signals significantly improves the stability and consistency of long video generation.
How to Get Started with StableAvatar
If you are a developer or researcher interested in experimenting with StableAvatar, follow the steps below.
Environment Setup
Choose the installation command based on your GPU type:
For Most NVIDIA GPUs (CUDA 12.4):
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install flash_attn # Optional, for accelerated attention computation
For Blackwell Series GPUs (e.g., RTX 6000 Pro):
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash_attn
Download Model Weights
It is recommended to use the Hugging Face mirror for faster downloads:
export HF_ENDPOINT=https://hf-mirror.com
pip install "huggingface_hub[cli]"
mkdir checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints
After downloading, organize the files as follows:
StableAvatar/
├── checkpoints
│ ├── Kim_Vocal_2.onnx
│ ├── wav2vec2-base-960h
│ ├── Wan2.1-Fun-V1.1-1.3B-InP
│ └── StableAvatar-1.3B
…
Audio Extraction and Vocal Separation
If you have a video file (e.g., .mp4), first extract the audio:
python audio_extractor.py --video_path="path/to/video.mp4" --saved_audio_path="path/to/audio.wav"
To improve lip-sync quality, separate vocal sounds from background music:
pip install audio-separator[gpu]
python vocal_seperator.py --audio_separator_model_file="checkpoints/Kim_Vocal_2.onnx" --audio_file_path="path/to/audio.wav" --saved_vocal_path="path/to/vocal.wav"
Generate Videos via Inference
Use the provided script to start generating videos:
bash inference.sh
You can modify parameters in inference.sh
, such as resolution (supports 512×512, 480×832, 832×480), reference image path, audio path, and more.
💡 Helpful Tips:
-
Recommended inference steps ( sample_steps
): 30–50 (more steps yield higher quality); -
Overlap window length ( overlap_window_length
): 5–15 (longer overlaps improve smoothness but slow down inference); -
If GPU memory is limited, set --GPU_memory_mode
tosequential_cpu_offload
, which uses only ~3 GB VRAM.
Add Audio to the Generated Video
StableAvatar outputs video without audio. Use ffmpeg to merge audio and video:
ffmpeg -i video_no_audio.mp4 -i audio.wav -c:v copy -c:a aac -shortest output_with_audio.mp4
How to Train Your Own StableAvatar Model
If you wish to train the model on your custom dataset, StableAvatar provides full training support.
Prepare Your Data
Organize your dataset in the following structure:
talking_face_data/
├── square/ # 512x512 videos
├── rec/ # 832x480 videos
├── vec/ # 480x832 videos
└── {speech, singing, dancing}/
└── {id}/
├── images/ # video frames
├── face_masks/ # face masks
├── lip_masks/ # lip masks
├── sub_clip.mp4 # original video
└── audio.wav # audio track
Use FFmpeg to extract frames from videos:
ffmpeg -i raw_video.mp4 -q:v 1 -start_number 0 talking_face_data/rec/speech/00001/images/frame_%d.png
For face mask extraction, refer to StableAnimator. For lip masks, use:
python lip_mask_extractor.py --folder_root="talking_face_data/rec/singing" --start=1 --end=500
Start Training
Single-Machine Training (512×512):
bash train_1B_square.sh
Multi-Machine Training (Mixed Resolutions):
bash train_1B_rec_vec_64.sh
Important notes during training:
-
Mixed-resolution training requires approximately 50 GB of VRAM; -
Use videos with static backgrounds for better reconstruction loss calculation; -
Audio should be clear with minimal background noise.
Model Fine-Tuning and LoRA Training
StableAvatar supports both full fine-tuning and lightweight LoRA fine-tuning:
# Full fine-tuning
bash train_1B_rec_vec.sh
# LoRA fine-tuning
bash train_1B_rec_vec_lora.sh
Adjust rank
and network_alpha
to control the quality and adaptability of LoRA.
System Requirements and Performance Reference
Task | VRAM Usage | Time (RTX 4090) |
---|---|---|
5-second video (480×832) | ~18 GB | ~3 minutes |
Training (mixed resolution) | ~50 GB | Depends on epochs |
LoRA fine-tuning | Reduced | Significantly faster |
Note: Although StableAvatar can generate hours of video, decoding extremely long sequences may require running the VAE decoder on the CPU to save VRAM.
Frequently Asked Questions (FAQ)
Q1: What resolutions does StableAvatar support?
It currently supports 512×512, 480×832, and 832×480 resolutions, configurable via parameters.
Q2: Can StableAvatar be used for commercial purposes?
Please refer to the project license. Currently, the model weights are available for free on Hugging Face for research purposes.
Q3: Is there a length limit for generated videos?
Theoretically, no. However, hardware memory constraints apply. The sliding window mechanism allows分段生成 (segment-by-segment generation).
Q4: Can it generate non-facial or object videos?
The current model is optimized for human faces, but the architecture has the potential to be extended to other object categories.
Q5: How can I improve lip-sync quality?
Use vocal-separated audio and increase the audio guidance scale (sample_audio_guide_scale
).
Conclusion
StableAvatar represents a significant leap forward in audio-driven video generation. Its ability to produce infinite-length, high-fidelity, and identity-consistent videos holds great promise not only for research but also for practical applications such as virtual humans, digital spokespersons, and online education.
If you are interested in this technology, visit the project page or download the code to try it yourself. If you use StableAvatar in your research or applications, please consider citing the paper to support open-source development.
@article{tu2025stableavatar,
title={StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation},
author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Luo, Chong and Wu, Zuxuan and Jiang Yu-Gang},
journal={arXiv preprint arXiv:2508.08248},
year={2025}
}