Generating Long-Form Narrative Audio with Large Language Models: Introducing AudioStory

Have you ever wondered how to turn a detailed story description into a seamless audio track that lasts for minutes, complete with smooth transitions and consistent emotions? For instance, imagine creating an audio clip where a musician plays a complex piece on the ukulele, gets applause from the audience, and then talks about their career in an interview—all in one continuous flow. Traditional tools for turning text into audio often fall short when it comes to longer narratives because they lack the ability to maintain coherence over time or handle layered storytelling. That’s where AudioStory comes in. This innovative framework combines large language models with text-to-audio systems to produce structured, extended audio stories. In this post, we’ll explore what AudioStory is, how it works, its key features, real-world examples, and practical steps for getting started.

What Is AudioStory?

AudioStory is a system designed to create long-form narrative audio using large language models. It builds on a unified approach that handles both understanding the input and generating the output. This makes it suitable for tasks like dubbing videos, extending existing audio clips, and synthesizing complete narrative soundscapes from scratch.

At its core, AudioStory addresses a common problem in text-to-audio generation: while many tools excel at short clips, they struggle with longer pieces that require logical flow and emotional consistency. AudioStory uses large language models to break down complex instructions into a series of timed sub-tasks, each with contextual details to ensure everything connects smoothly.

Two standout features make AudioStory effective:

A decoupled bridging mechanism that separates the work between the language model and the audio generator. One part focuses on aligning details within each event, while the other ensures consistency across the entire sequence.
End-to-end training, which integrates instruction understanding and audio creation in a single process, avoiding the need for separate modules and improving how the parts work together.

The project also includes a benchmark called AudioStory-10K, which covers various areas like animated sound effects and natural sound stories. Tests show that AudioStory outperforms earlier text-to-audio methods in following instructions accurately and producing high-quality sound.

How Does AudioStory Work?

Let’s break it down step by step. AudioStory operates on a framework that unifies understanding and generation. Whether you start with a text description or an existing audio clip, the large language model first analyzes the input and splits it into structured audio sub-events, each with relevant context.

From there, the model engages in interleaved reasoning and generation. It produces captions for each segment, followed by semantic tokens (which capture the meaning) and residual tokens (which add finer details). These tokens are combined and fed into a diffusion transformer, effectively linking the language model to the audio producer. Through step-by-step training, the system develops strong skills in comprehending instructions and delivering clear audio.

To make this clearer, consider the process in phases:

Input Analysis: The model takes a text prompt or audio stream and identifies key elements, such as events in sequence.
Decomposition: It divides the narrative into time-ordered parts. For example, a story about a fire truck might include leaving the station, sirens activating, and driving off.
Token Generation: For each part, it creates descriptive captions, semantic tokens for core content, and residual tokens for nuances.
Fusion and Output: The tokens merge and go through the diffusion transformer to generate the final audio.

This method ensures temporal coherence—meaning the audio flows logically over time—and compositional reasoning, where parts build on each other meaningfully.

Key Features of AudioStory

What sets AudioStory apart? Beyond the basics, it has specific strengths that make it versatile for narrative audio creation.

Instruction-Following Capabilities: The system excels at interpreting detailed prompts, ensuring the output matches the user’s intent closely.
Coherence in Long Sequences: By using contextual cues in sub-tasks, it maintains smooth transitions and consistent tones, like shifting from excitement to calm in a story.
Decoupled Bridging: This splits the collaboration into intra-event alignment (matching sounds to descriptions within a segment) and cross-event preservation (keeping the overall narrative intact).
Unified Training: Everything is trained end-to-end, fostering better integration without relying on pieced-together components.

Additionally, the AudioStory-10K benchmark provides a solid testing ground, with examples from animated scenarios to everyday natural sounds. Experiments highlight improvements in both single-clip generation and full narratives, particularly in fidelity—the clarity and realism of the audio.

Real-World Demonstrations of AudioStory

Seeing AudioStory in action helps illustrate its potential. The project includes several demos across different applications.

Video Dubbing in Tom & Jerry Style

One application is dubbing videos using a model trained on Tom & Jerry content. It incorporates visual captions from the video to sync audio perfectly.

Here are a few examples:

A chase scene where sounds match the on-screen action, like footsteps and crashes.
An interaction between characters with exaggerated effects typical of the cartoon.
A third clip showing emotional shifts, such as surprise or laughter.

These demonstrate how AudioStory can enhance existing videos with fitting audio.

Cross-Domain Video Dubbing in Tom & Jerry Style

AudioStory isn’t limited to one style; it can apply Tom & Jerry dubbing to other types of videos, bridging domains.

Examples include:

Dubbing a real-world animal video with cartoonish sounds.
Adding audio to a different animation, maintaining the fun tone.
Applying it to human scenes, like people in everyday situations, for a playful twist.

This shows the system’s flexibility in adapting styles across content types.

Text-to-Long Audio with Natural Sounds

For pure text inputs, AudioStory generates extended audio based on descriptions.

Prompt: “Develop a comprehensive audio that fully represents Jake Shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds.”
- Result: A 49.9-second clip covering the performance, crowd reaction, and conversation.
Prompt: “Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds.”
- Result: 35.1 seconds of audio capturing the departure, sirens, and fading engine noise.
Prompt: “Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds.”
- Result: An extension of 36.6 seconds, building on the initial audio with logical follow-up events.

These examples highlight AudioStory’s ability to handle natural, real-world sounds in narrative form.

Getting Started with AudioStory: Installation Guide

Ready to try it yourself? Installing AudioStory is straightforward if you have the right setup.

Requirements

You’ll need:

Python version 3.10 or higher. Using Anaconda is a good idea for managing environments.
PyTorch version 2.1.0 or later.
An NVIDIA GPU with CUDA support for efficient processing.

Step-by-Step Installation

Clone the repository from GitHub:

git clone https://github.com/TencentARC/AudioStory.git

Navigate into the project folder:
```
cd AudioStory
```

Create and activate a new virtual environment:

conda create -n audiostory python=3.10 -y
conda activate audiostory

Run the installation script to set up dependencies:
```
bash install_audiostory.sh
```

This process installs everything needed. If you run into issues, double-check your GPU drivers and CUDA installation.

Using AudioStory for Inference

Once installed, you can generate audio using the inference script. This is where you test the model with your own prompts.

Inference Command

Run this in your terminal:

python evaluate/inference.py --model_path /path/to/ckpt --guidance 4.0 --save_folder_name audiostory --total_duration 50

--model_path: Points to your model checkpoint file.
--guidance: Sets the guidance level (4.0 is a common starting point).
--save_folder_name: Names the folder for output files.
--total_duration: Specifies the audio length in seconds (e.g., 50 for a 50-second clip).

Adjust these as needed for different results. The output will be saved in the specified folder.

Evaluation and Performance Insights

AudioStory has been tested extensively, showing strong results in key areas.

Aspect	Performance Highlight	Comparison to Baselines
Instruction Following	High accuracy in matching complex prompts	Superior to prior tools
Audio Fidelity	Clear, realistic sound quality	Better preservation
Narrative Coherence	Smooth transitions across events	Outperforms in long forms

The AudioStory-10K dataset supports these findings, with diverse examples ensuring broad applicability.

Frequently Asked Questions About AudioStory

Based on common curiosities, here are direct answers to potential questions.

What kinds of tasks can AudioStory handle?

It manages video dubbing, audio continuation, and synthesizing long narrative audio from text.

How does AudioStory ensure the audio stays consistent over time?

By breaking prompts into sub-events with context and using bridging mechanisms for alignment.

Is the training process modular or integrated?

It’s end-to-end, unifying understanding and generation for better results.

Can I use AudioStory for dubbing non-cartoon videos?

Yes, as shown in cross-domain examples applying Tom & Jerry style to various content.

What if I want to generate audio longer than 50 seconds?

Simply increase the --total_duration in the inference command, though hardware limits may apply.

Does AudioStory work with natural sounds only, or animated too?

It covers both, as per the benchmark including animated soundscapes and natural narratives.

What should I do if installation fails due to dependencies?

Verify Python and PyTorch versions, and ensure CUDA is properly set up.

Are there plans for more features?

The project aims to release a Gradio demo, model checkpoints, and training code for all stages.

How do I cite AudioStory in my work?

Use this BibTeX entry:

@misc{guo2025audiostory,
      title={AudioStory: Generating Long-Form Narrative Audio with Large Language Models}, 
      author={Yuxin Guo and Teng Wang and Yuying Ge and Shijie Ma and Yixiao Ge and Wei Zou and Ying Shan},
      year={2025},
      eprint={2508.20088},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.20088}, 
}

What license does AudioStory use?

It’s under the Apache 2 License, allowing open use.

Deeper Dive into AudioStory’s Technical Aspects

For those with a bit more background, let’s explore the mechanics further.

The unified framework is key: it starts with input decomposition, where the large language model reasons about sub-events. This is crucial for effective instruction-following audio generation, as understanding the prompt or audio stream allows for relevant breakdowns.

The interleaved process—generating captions, then semantic and residual tokens—bridges the gap between text reasoning and audio output. The diffusion transformer handles the final synthesis, benefiting from the decoupled approach that separates internal and cross-event tasks.

Training progresses gradually, building from basic to advanced narratives, which enhances synergy.

Practical Applications and Case Studies

Consider a musician wanting to simulate a performance sequence: Input a prompt, run inference, and get a full audio story.

Or for educators: Extend a coaching audio clip logically, inferring next steps like in the basketball lesson example.

In content creation, dub videos across styles, adding narrative depth.

Steps for a basic use case:

Prepare your prompt with duration.
Execute the inference script.
Review and refine based on output.

Acknowledgments and Contacts

AudioStory builds on prior work like SEED-X and TangoFlux for certain code elements. The team includes Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, and Ying Shan from the Institute of Automation at the Chinese Academy of Sciences and the ARC Lab at Tencent PCG.

For questions, reach out to guoyuxin2021@ia.ac.cn. Discussions are encouraged.

Wrapping Up: The Value of AudioStory

AudioStory represents a step forward in creating engaging, long-form audio narratives. By leveraging large language models for smart decomposition and generation, it opens doors for creators in various fields. Whether you’re experimenting with sound stories or dubbing content, it’s a tool worth exploring. As the project evolves with upcoming releases, it promises even more accessibility.

AudioStory: Revolutionizing Long-Form Narrative Audio with Advanced LLM Technology