StableAvatar: Generating Infinite-Length Audio-Driven Avatar Videos with AI

The field of artificial intelligence is continuously evolving, and one of the most exciting challenges researchers and developers face is creating virtual avatars that can speak, sing, or perform based solely on audio input—without limitations on video length. Meet StableAvatar, a groundbreaking solution designed to tackle this very problem.

This advanced AI model can generate high-fidelity, identity-consistent avatar videos of theoretically infinite length, entirely from a reference image and an audio clip. What sets it apart is its complete end-to-end generation capability—it does not rely on any external face-processing tools like FaceFusion, GFP-GAN, or CodeFormer.


What Is StableAvatar?

StableAvatar is an audio-driven video generation model based on a Diffusion Transformer (DiT). By providing a single reference image and an audio track, the model synthesizes a video in which the avatar’s lip movements, expressions, and gestures are synchronized with the audio—and it can do so for videos of any duration.

🔥 Project Page: https://francis-rings.github.io/StableAvatar
📄 Research Paper: https://arxiv.org/abs/2508.08248
🧠 Model Weights: https://huggingface.co/FrancisRing/StableAvatar

Why Is This a Breakthrough?

Existing audio-driven video generation models typically suffer from several limitations:

  • Inability to generate long videos without losing lip synchronization;
  • Identity inconsistency—the character gradually looks less like the reference;
  • Dependence on post-processing tools, making the pipeline complex and inefficient.

StableAvatar introduces several innovations to overcome these challenges:

  1. Time-step-aware Audio Adapter: Prevents error accumulation during generation and avoids latent distribution drift between video segments.
  2. Audio Native Guidance Mechanism: Enhances audio-video synchronization during inference.
  3. Dynamic Weighted Sliding-window Strategy: Improves temporal coherence and smoothness in long-duration videos.

Demonstrations of StableAvatar

Below are sample videos generated by StableAvatar. All results are raw outputs without any post-processing:

https://github.com/user-attachments/assets/d7eca208-6a14-46af-b337-fb4d2b66ba8d

https://github.com/user-attachments/assets/b5902ac4-8188-4da8-b9e6-6df280690ed1

https://github.com/user-attachments/assets/87faa5c1-a118-4a03-a071-45f18e87e6a0

https://github.com/user-attachments/assets/531eb413-8993-4f8f-9804-e3c5ec5794d4

https://github.com/user-attachments/assets/cdc603e2-df46-4cf8-a14e-1575053f996f

https://github.com/user-attachments/assets/7022dc93-f705-46e5-b8fc-3a3fb755795c

https://github.com/user-attachments/assets/0ba059eb-ff6f-4d94-80e6-f758c613b737

https://github.com/user-attachments/assets/03e6c1df-85c6-448d-b40d-aacb8add4e45

https://github.com/user-attachments/assets/90b78154-dda0-4eaa-91fd-b5485b718a7f


How Does StableAvatar Work?

The following diagram illustrates the overall framework of StableAvatar:

StableAvatar Framework

The key stages include:

  • Input: Reference image + driving audio;
  • Processing: Custom training and inference modules;
  • Output: High-quality video of unlimited length.

Unlike previous approaches that often depend on third-party audio feature extractors, StableAvatar integrates audio processing internally. This deep fusion of audio and visual signals significantly improves the stability and consistency of long video generation.


How to Get Started with StableAvatar

If you are a developer or researcher interested in experimenting with StableAvatar, follow the steps below.

Environment Setup

Choose the installation command based on your GPU type:

For Most NVIDIA GPUs (CUDA 12.4):

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install flash_attn  # Optional, for accelerated attention computation

For Blackwell Series GPUs (e.g., RTX 6000 Pro):

pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install flash_attn

Download Model Weights

It is recommended to use the Hugging Face mirror for faster downloads:

export HF_ENDPOINT=https://hf-mirror.com
pip install "huggingface_hub[cli]"
mkdir checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints

After downloading, organize the files as follows:

StableAvatar/
├── checkpoints
│   ├── Kim_Vocal_2.onnx
│   ├── wav2vec2-base-960h
│   ├── Wan2.1-Fun-V1.1-1.3B-InP
│   └── StableAvatar-1.3B
…

Audio Extraction and Vocal Separation

If you have a video file (e.g., .mp4), first extract the audio:

python audio_extractor.py --video_path="path/to/video.mp4" --saved_audio_path="path/to/audio.wav"

To improve lip-sync quality, separate vocal sounds from background music:

pip install audio-separator[gpu]
python vocal_seperator.py --audio_separator_model_file="checkpoints/Kim_Vocal_2.onnx" --audio_file_path="path/to/audio.wav" --saved_vocal_path="path/to/vocal.wav"

Generate Videos via Inference

Use the provided script to start generating videos:

bash inference.sh

You can modify parameters in inference.sh, such as resolution (supports 512×512, 480×832, 832×480), reference image path, audio path, and more.

💡 Helpful Tips:

  • Recommended inference steps (sample_steps): 30–50 (more steps yield higher quality);
  • Overlap window length (overlap_window_length): 5–15 (longer overlaps improve smoothness but slow down inference);
  • If GPU memory is limited, set --GPU_memory_mode to sequential_cpu_offload, which uses only ~3 GB VRAM.

Add Audio to the Generated Video

StableAvatar outputs video without audio. Use ffmpeg to merge audio and video:

ffmpeg -i video_no_audio.mp4 -i audio.wav -c:v copy -c:a aac -shortest output_with_audio.mp4

How to Train Your Own StableAvatar Model

If you wish to train the model on your custom dataset, StableAvatar provides full training support.

Prepare Your Data

Organize your dataset in the following structure:

talking_face_data/
├── square/    # 512x512 videos
├── rec/       # 832x480 videos
├── vec/       # 480x832 videos
└── {speech, singing, dancing}/
    └── {id}/
        ├── images/          # video frames
        ├── face_masks/      # face masks
        ├── lip_masks/       # lip masks
        ├── sub_clip.mp4     # original video
        └── audio.wav        # audio track

Use FFmpeg to extract frames from videos:

ffmpeg -i raw_video.mp4 -q:v 1 -start_number 0 talking_face_data/rec/speech/00001/images/frame_%d.png

For face mask extraction, refer to StableAnimator. For lip masks, use:

python lip_mask_extractor.py --folder_root="talking_face_data/rec/singing" --start=1 --end=500

Start Training

Single-Machine Training (512×512):

bash train_1B_square.sh

Multi-Machine Training (Mixed Resolutions):

bash train_1B_rec_vec_64.sh

Important notes during training:

  • Mixed-resolution training requires approximately 50 GB of VRAM;
  • Use videos with static backgrounds for better reconstruction loss calculation;
  • Audio should be clear with minimal background noise.

Model Fine-Tuning and LoRA Training

StableAvatar supports both full fine-tuning and lightweight LoRA fine-tuning:

# Full fine-tuning
bash train_1B_rec_vec.sh

# LoRA fine-tuning
bash train_1B_rec_vec_lora.sh

Adjust rank and network_alpha to control the quality and adaptability of LoRA.


System Requirements and Performance Reference

Task VRAM Usage Time (RTX 4090)
5-second video (480×832) ~18 GB ~3 minutes
Training (mixed resolution) ~50 GB Depends on epochs
LoRA fine-tuning Reduced Significantly faster

Note: Although StableAvatar can generate hours of video, decoding extremely long sequences may require running the VAE decoder on the CPU to save VRAM.


Frequently Asked Questions (FAQ)

Q1: What resolutions does StableAvatar support?

It currently supports 512×512, 480×832, and 832×480 resolutions, configurable via parameters.

Q2: Can StableAvatar be used for commercial purposes?

Please refer to the project license. Currently, the model weights are available for free on Hugging Face for research purposes.

Q3: Is there a length limit for generated videos?

Theoretically, no. However, hardware memory constraints apply. The sliding window mechanism allows分段生成 (segment-by-segment generation).

Q4: Can it generate non-facial or object videos?

The current model is optimized for human faces, but the architecture has the potential to be extended to other object categories.

Q5: How can I improve lip-sync quality?

Use vocal-separated audio and increase the audio guidance scale (sample_audio_guide_scale).


Conclusion

StableAvatar represents a significant leap forward in audio-driven video generation. Its ability to produce infinite-length, high-fidelity, and identity-consistent videos holds great promise not only for research but also for practical applications such as virtual humans, digital spokespersons, and online education.

If you are interested in this technology, visit the project page or download the code to try it yourself. If you use StableAvatar in your research or applications, please consider citing the paper to support open-source development.

@article{tu2025stableavatar,
  title={StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation},
  author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Luo, Chong and Wu, Zuxuan and Jiang Yu-Gang},
  journal={arXiv preprint arXiv:2508.08248},
  year={2025}
}