LongCat-Audio-Codec Revolutionizes Speech LLMs with Ultra-Low Bitrate Speech Encoding

高效码农

2 months ago

LongCat-Audio-Codec: The Audio Tokenizer and Detokenizer Solution Revolutionizing Speech Large Language Models

In the rapidly evolving landscape of speech large language models, achieving high-quality audio reconstruction at low bitrates has emerged as a critical technological bottleneck. The open-source audio codec from Meituan’s LongCat team delivers a stunning solution to this challenge.

Understanding Audio Codecs and Their Critical Role in Speech LLMs

If you’ve ever used voice assistants, video conferencing software, or any audio processing tool, you’ve indirectly experienced audio codec technology. In simple terms, an audio codec acts as a “compression package” for audio data—it condenses massive raw audio signals into smaller data packets for efficient storage and transmission, then decompresses them back when needed.

In the realm of speech large language models, codecs play an even more crucial role. Traditional large language models like GPT series process text tokens, while speech LLMs need to handle audio signals. This presents a fundamental challenge: how to convert continuous speech signals into discrete token sequences while preserving sufficient semantic and acoustic information?

This is the core problem that LongCat-Audio-Codec addresses. It’s a specialized audio tokenization and detokenization solution designed specifically for speech large language models, enabling high-quality audio reconstruction at extremely low bitrates while providing robust infrastructure support for speech LLM development.

Core Innovations of LongCat-Audio-Codec

Decoupled Semantic-Acoustic Tokenization Architecture

Traditional audio codecs typically focus solely on reconstructing acoustic details while neglecting semantic information preservation. LongCat-Audio-Codec employs an innovative dual-path architecture that separately processes semantic and acoustic information.

Think about when you listen to someone speak: your brain simultaneously processes two layers of information—what the person is saying (semantics), and how they’re saying it, including tone, pitch, and rhythm (acoustic characteristics). LongCat-Audio-Codec mimics this process:

Semantic Encoder: Focuses on extracting linguistic content from speech, similar to understanding “what is being said”
Acoustic Encoder: Specializes in capturing detailed speech characteristics, akin to recognizing “how it’s being said”

This decoupled architecture offers significant advantages. Semantic tokens ensure accurate transmission of speech content, while acoustic tokens guarantee naturalness and authenticity. In practical applications, this separation also allows users to flexibly adjust the number of acoustic tokens based on specific needs, finding the optimal balance between bitrate and reconstruction quality.

Ultra-Low Frame Rate and Multi-Codebook Configuration

LongCat-Audio-Codec encodes speech at an ultra-low frame rate of 16.67 Hz, meaning it generates approximately only 17 token frames per second. Compared to traditional codecs operating at 50-100 Hz frame rates, this significantly reduces token count and alleviates processing burden on downstream language models.

LongCat-Audio-Codec Architecture Diagram

The codec supports flexible codebook configurations, allowing users to choose between 2, 3, or 4 codebooks, with corresponding bitrates of 0.43 kbps, 0.65 kbps, and 0.87 kbps respectively. This flexibility enables the model to adapt to various application scenarios:

2-Codebook Configuration (0.43 kbps): Ideal for bandwidth-sensitive scenarios like real-time communication
3-Codebook Configuration (0.65 kbps): Strikes a balance between bitrate and quality
4-Codebook Configuration (0.87 kbps): Suitable for scenarios requiring higher audio fidelity

Low-Latency Streaming Decoder

In practical applications, many scenarios like real-time voice communication and interactive voice assistants are extremely sensitive to latency. LongCat-Audio-Codec’s decoder employs specialized streaming design, requiring only 3 frames (approximately 180 milliseconds) of future information to generate high-quality audio output.

This low-latency characteristic makes it particularly suitable for real-time applications, where users experience minimal processing delay for more natural and fluid interactions.

Multi-Stage Training Strategy

LongCat-Audio-Codec’s training process is divided into three distinct stages, each with clear objectives:

Stage 1: Encoder Pre-training
In this phase, the team utilized approximately 500,000 hours of diverse speech data, ensuring the encoder can adapt to various speech patterns and acoustic environments. The focus is on improving speech intelligibility rather than overly focusing on acoustic details.

Stage 2: Decoder Pre-training
Encoder parameters are frozen, and only the decoder is trained. This stage uses approximately 1,000 hours of high-quality recorded data and 250,000 hours of enhanced processed speech data. The decoder learns to “translate” tokens back into high-quality audio, even capable of repairing some defects in the input audio.

Stage 3: Decoder Fine-tuning (Optional)
Further optimization for specific speakers improves reconstruction quality for target speaker timbres. This stage is particularly valuable for speech synthesis applications.

Train-More-Use-Less (TMUL) Technology

This is an ingenious training strategy: the team first trains a codec with more codebooks (such as 4-codebook), then selectively uses only a subset of codebooks (such as 2-codebook) during actual deployment. Experiments prove this approach yields better results than directly training a 2-codebook codec.

Why does this work? Think of it this way: when all information must be compressed into a single codebook, some outliers encroach upon the representation space of mainstream information. In multi-codebook training, the first-layer codebooks can capture the most crucial information, resulting in lower average reconstruction error.

Performance Evaluation of LongCat-Audio-Codec

Comparative Analysis with Similar Technologies

Evaluation results on the standardized test set LibriTTS testset-B demonstrate LongCat-Audio-Codec’s excellent performance across different bitrates:

In the 0.85-2 kbps range (using 4-codebook configuration, 0.87 kbps):

Word Error Rate (WER): 1.48, significantly outperforming similar semantic codecs
Perceptual Evaluation of Speech Quality (PESQ): 2.30, best performance
Short-Time Objective Intelligibility (STOI): 0.921, strong competitiveness

In the 0.65-0.85 kbps range (using 3-codebook configuration, 0.65 kbps):

Word Error Rate (WER): 1.70, substantially领先于同类技术
Perceptual Evaluation of Speech Quality (PESQ): 2.01, good performance

In the <0.65 kbps range (using 2-codebook configuration, 0.43 kbps):

Word Error Rate (WER): 2.10, maintaining good intelligibility even at extremely low bitrates
Short-Time Objective Intelligibility (STOI): 0.839,明显优于竞争对手

Reconstruction Quality Trends Across Bitrates

As bitrate increases, LongCat-Audio-Codec’s various metrics show consistent and positive improvement trends:

Word Error Rate (WER): Decreases from 2.10 to 1.48, 29.5% improvement
Gross Pitch Error (GPE): Decreases from 3.69 to 1.65, 55.3% improvement
Perceptual Evaluation of Speech Quality (PESQ): Increases from 1.47 to 2.30, 56.5% improvement
Short-Time Objective Intelligibility (STOI): Increases from 0.839 to 0.921, 9.8% improvement

These data prove that LongCat-Audio-Codec can provide corresponding quality assurance at different bitrate requirements, allowing users to flexibly choose configurations based on actual application scenarios.

Speaker Similarity and Audio Quality Enhancement

Through the multi-stage training strategy, LongCat-Audio-Codec has also made significant progress in speaker similarity and audio quality:

As shown in the figure, after Stage 2 and Stage 3 training, speaker similarity significantly improved from 0.717 to 0.938. This means the reconstructed speech is not only accurate in content but also closer to the original speaker in timbre, tone, and prosody.

In terms of audio quality, the 24kHz decoder trained in Stage 2 showed significant improvement across multiple objective quality metrics, even surpassing the quality level of the original test set in some dimensions.

Getting Started with LongCat-Audio-Codec

Environment Setup and Configuration

LongCat-Audio-Codec is built on PyTorch, making installation straightforward:

# Create conda environment
conda create -n LongCat-Audio-Codec python=3.10
conda activate LongCat-Audio-Codec

# Install PyTorch (choose the appropriate version for your hardware configuration)
pip install torch==2.7.1 torchaudio==2.7.1

# Install other dependencies
pip install -r requirements.txt

Model Download

LongCat-Audio-Codec provides multiple pre-trained models for users to choose from based on their needs:

Model Name	Description	Download Link
LongCatAudioCodec_encoder	Weights including semantic encoder and acoustic encoder	Hugging Face
LongCatAudioCodec_encoder_cmvn	Cepstral Mean and Variance Normalization coefficients	Hugging Face
LongCatAudioCodec_decoder16k_4codebooks	16kHz decoder, supporting up to 3 acoustic codebooks	Hugging Face
LongCatAudioCodec_decoder24k_2codebooks	24kHz decoder, supporting 1 acoustic codebook, fine-tuned on limited speakers	Hugging Face
LongCatAudioCodec_decoder24k_4codebooks	24kHz decoder, supporting up to 3 acoustic codebooks	Hugging Face

Project Structure Setup

After downloading the models, ensure the project structure is correct:

LongCat-Audio-Codec/
├── ckpts/
│   ├── LongCatAudioCodec_decoder_16k_4codebooks.pt
│   ├── LongCatAudioCodec_decoder_24k_2codebooks.pt
│   ├── LongCatAudioCodec_decoder_24k_4codebooks.pt
│   ├── LongCatAudioCodec_encoder.pt
│   └── LongCatAudioCodec_encoder_cmvn.npy
├── configs/
│   ├── LongCatAudioCodec_decoder_16k_4codebooks.yaml
│   ├── LongCatAudioCodec_decoder_24k_2codebooks.yaml
│   ├── LongCatAudioCodec_decoder_24k_4codebooks.yaml
│   └── LongCatAudioCodec_encoder.yaml
├── inference.py
└── run_inference.sh

Running Demos

The project provides a simple demonstration script to quickly experience LongCat-Audio-Codec’s functionality:

bash ./run_inference.sh

This script automatically processes demonstration audio files and generates reconstructed audio output. Users can find the processing results in the demo_audio_output/ directory.

Custom Usage

For users with specific requirements, they can directly call the inference.py script with custom parameters:

python inference.py \
    --encoder_config "configs/LongCatAudioCodec_encoder.yaml" \
    --decoder16k_config "configs/LongCatAudioCodec_decoder_16k_4codebooks.yaml" \
    --decoder24k_config "configs/LongCatAudioCodec_decoder_24k_4codebooks.yaml" \
    --output_dir "my_custom_output" \
    --n_acoustic_codebooks 3 \
    --audio_files "path/to/my.wav"

Practical Application Scenarios for LongCat-Audio-Codec

Infrastructure for Speech Large Language Models

As a codec specifically designed for speech large language models, LongCat-Audio-Codec can convert speech signals into discrete token sequences that can directly serve as input for large language models. Simultaneously, it can transform token sequences generated by models back into high-quality speech signals, completing end-to-end speech processing pipelines.

Low-Bitrate Speech Communication

In bandwidth-constrained environments, such as mobile networks or remote area networks, LongCat-Audio-Codec’s extremely low bitrate characteristics (minimum 0.43 kbps) can significantly improve speech communication quality while reducing data usage.

Speech Synthesis and Cloning

Leveraging the multi-stage training strategy, particularly the speaker-specific fine-tuning in Stage 3, LongCat-Audio-Codec can generate highly natural speech and accurately reproduce specific speaker timbre characteristics.

Audio Storage and Archiving

For applications requiring large-scale speech data storage, such as voice assistant conversation records or customer service recordings, LongCat-Audio-Codec can dramatically reduce storage space requirements while maintaining intelligibility.

Limitations of LongCat-Audio-Codec

Despite LongCat-Audio-Codec’s excellent performance in multiple aspects, the current version still has some limitations:

Primarily Optimized for Speech: The current version is mainly optimized for speech signals, with limited support for music and sound effects.
Input Length Restrictions: The model can process up to 30 seconds of audio input, with longer audio requiring pre-segmentation.
Speaker Dependency of Specific Decoders: LongCatAudioCodec_decoder_24k_2codebooks.pt has been fine-tuned on limited speakers. If input audio comes from speakers outside the training set, reconstruction quality may degrade.
Mono Audio Support Only: The current version only supports mono audio processing, not stereo.

The team has indicated they will optimize and improve these limitations in subsequent versions.

Frequently Asked Questions

How does LongCat-Audio-Codec compare to other audio codecs like EnCodec and DAC?

LongCat-Audio-Codec is specifically designed for speech large language models, employing a semantic-acoustic decoupled architecture that maintains good semantic understanding capabilities even at extremely low bitrates. Compared to traditional codecs, it significantly reduces bitrate while maintaining high speech intelligibility, making it more suitable as front-end and back-end for large language models.

How to choose the appropriate number of codebooks?

The choice of codebook quantity depends on specific application scenarios:

Bandwidth-extremely-sensitive scenarios: Choose 2 codebooks (0.43 kbps)
Scenarios balancing quality and efficiency: Choose 3 codebooks (0.65 kbps)
Scenarios requiring higher audio quality: Choose 4 codebooks (0.87 kbps)

Users can determine the most suitable configuration for their needs through experimentation.

Does LongCat-Audio-Codec support real-time processing?

Yes, LongCat-Audio-Codec’s decoder uses streaming design, requiring only 180 milliseconds of future information to generate high-quality audio, making it very suitable for real-time applications.

Is specialized hardware required to run LongCat-Audio-Codec?

No specialized hardware is needed. LongCat-Audio-Codec can run in standard CPU and GPU environments. However, using GPU can significantly improve processing speed, especially when handling large amounts of audio data.

Can the model be fine-tuned with my own data?

Yes, the team provides a multi-stage training strategy. Users can refer to the Stage 3 method to fine-tune the decoder with their own data to adapt to specific speakers or audio characteristics.

Conclusion

LongCat-Audio-Codec represents a significant advancement in audio codec technology integrated with large language models. Through innovative semantic-acoustic separation architecture, multi-stage training strategy, and flexible codebook configuration, it achieves high-quality audio reconstruction at extremely low bitrates, providing reliable infrastructure for speech large language model development.

As speech technology continues to advance, we have reason to believe that technologies like LongCat-Audio-Codec will play increasingly important roles in future speech interaction, speech synthesis, and speech communication fields. Both researchers and developers can utilize this open-source tool to explore more possibilities in speech technology.

Citations and Resources

If you use LongCat-Audio-Codec in your research, please cite the following paper:

@article{longcataudiocodec,
  title={LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models},
  author={Xiaohan Zhao, Hongyu Xiang, Shengze Ye, Song Li, Zhengkun Tian, Guanyu Chen, Ke Ding, Guanglu Wan},
  journal={arXiv preprint arXiv:2510.15227},
  organization={LongCat Team, Meituan},
  year={2025}
}

Project Resources: