LongCat-Audio-Codec: The Audio Tokenizer and Detokenizer Solution Revolutionizing Speech Large Language Models
In the rapidly evolving landscape of speech large language models, achieving high-quality audio reconstruction at low bitrates has emerged as a critical technological bottleneck. The open-source audio codec from Meituan’s LongCat team delivers a stunning solution to this challenge.
Understanding Audio Codecs and Their Critical Role in Speech LLMs
If you’ve ever used voice assistants, video conferencing software, or any audio processing tool, you’ve indirectly experienced audio codec technology. In simple terms, an audio codec acts as a “compression package” for audio data—it condenses massive raw audio signals into smaller data packets for efficient storage and transmission, then decompresses them back when needed.
In the realm of speech large language models, codecs play an even more crucial role. Traditional large language models like GPT series process text tokens, while speech LLMs need to handle audio signals. This presents a fundamental challenge: how to convert continuous speech signals into discrete token sequences while preserving sufficient semantic and acoustic information?
This is the core problem that LongCat-Audio-Codec addresses. It’s a specialized audio tokenization and detokenization solution designed specifically for speech large language models, enabling high-quality audio reconstruction at extremely low bitrates while providing robust infrastructure support for speech LLM development.
Core Innovations of LongCat-Audio-Codec
Decoupled Semantic-Acoustic Tokenization Architecture
Traditional audio codecs typically focus solely on reconstructing acoustic details while neglecting semantic information preservation. LongCat-Audio-Codec employs an innovative dual-path architecture that separately processes semantic and acoustic information.
Think about when you listen to someone speak: your brain simultaneously processes two layers of information—what the person is saying (semantics), and how they’re saying it, including tone, pitch, and rhythm (acoustic characteristics). LongCat-Audio-Codec mimics this process:
-
Semantic Encoder: Focuses on extracting linguistic content from speech, similar to understanding “what is being said” -
Acoustic Encoder: Specializes in capturing detailed speech characteristics, akin to recognizing “how it’s being said”
This decoupled architecture offers significant advantages. Semantic tokens ensure accurate transmission of speech content, while acoustic tokens guarantee naturalness and authenticity. In practical applications, this separation also allows users to flexibly adjust the number of acoustic tokens based on specific needs, finding the optimal balance between bitrate and reconstruction quality.
Ultra-Low Frame Rate and Multi-Codebook Configuration
LongCat-Audio-Codec encodes speech at an ultra-low frame rate of 16.67 Hz, meaning it generates approximately only 17 token frames per second. Compared to traditional codecs operating at 50-100 Hz frame rates, this significantly reduces token count and alleviates processing burden on downstream language models.
The codec supports flexible codebook configurations, allowing users to choose between 2, 3, or 4 codebooks, with corresponding bitrates of 0.43 kbps, 0.65 kbps, and 0.87 kbps respectively. This flexibility enables the model to adapt to various application scenarios:
-
2-Codebook Configuration (0.43 kbps): Ideal for bandwidth-sensitive scenarios like real-time communication -
3-Codebook Configuration (0.65 kbps): Strikes a balance between bitrate and quality -
4-Codebook Configuration (0.87 kbps): Suitable for scenarios requiring higher audio fidelity
Low-Latency Streaming Decoder
In practical applications, many scenarios like real-time voice communication and interactive voice assistants are extremely sensitive to latency. LongCat-Audio-Codec’s decoder employs specialized streaming design, requiring only 3 frames (approximately 180 milliseconds) of future information to generate high-quality audio output.
This low-latency characteristic makes it particularly suitable for real-time applications, where users experience minimal processing delay for more natural and fluid interactions.
Multi-Stage Training Strategy
LongCat-Audio-Codec’s training process is divided into three distinct stages, each with clear objectives:
Stage 1: Encoder Pre-training
In this phase, the team utilized approximately 500,000 hours of diverse speech data, ensuring the encoder can adapt to various speech patterns and acoustic environments. The focus is on improving speech intelligibility rather than overly focusing on acoustic details.
Stage 2: Decoder Pre-training
Encoder parameters are frozen, and only the decoder is trained. This stage uses approximately 1,000 hours of high-quality recorded data and 250,000 hours of enhanced processed speech data. The decoder learns to “translate” tokens back into high-quality audio, even capable of repairing some defects in the input audio.
Stage 3: Decoder Fine-tuning (Optional)
Further optimization for specific speakers improves reconstruction quality for target speaker timbres. This stage is particularly valuable for speech synthesis applications.
Train-More-Use-Less (TMUL) Technology
This is an ingenious training strategy: the team first trains a codec with more codebooks (such as 4-codebook), then selectively uses only a subset of codebooks (such as 2-codebook) during actual deployment. Experiments prove this approach yields better results than directly training a 2-codebook codec.
Why does this work? Think of it this way: when all information must be compressed into a single codebook, some outliers encroach upon the representation space of mainstream information. In multi-codebook training, the first-layer codebooks can capture the most crucial information, resulting in lower average reconstruction error.
Performance Evaluation of LongCat-Audio-Codec
Comparative Analysis with Similar Technologies
Evaluation results on the standardized test set LibriTTS testset-B demonstrate LongCat-Audio-Codec’s excellent performance across different bitrates:
In the 0.85-2 kbps range (using 4-codebook configuration, 0.87 kbps):
-
Word Error Rate (WER): 1.48, significantly outperforming similar semantic codecs -
Perceptual Evaluation of Speech Quality (PESQ): 2.30, best performance -
Short-Time Objective Intelligibility (STOI): 0.921, strong competitiveness
In the 0.65-0.85 kbps range (using 3-codebook configuration, 0.65 kbps):
-
Word Error Rate (WER): 1.70, substantially领先于同类技术 -
Perceptual Evaluation of Speech Quality (PESQ): 2.01, good performance
In the <0.65 kbps range (using 2-codebook configuration, 0.43 kbps):
-
Word Error Rate (WER): 2.10, maintaining good intelligibility even at extremely low bitrates -
Short-Time Objective Intelligibility (STOI): 0.839,明显优于竞争对手
Reconstruction Quality Trends Across Bitrates
As bitrate increases, LongCat-Audio-Codec’s various metrics show consistent and positive improvement trends:
-
Word Error Rate (WER): Decreases from 2.10 to 1.48, 29.5% improvement -
Gross Pitch Error (GPE): Decreases from 3.69 to 1.65, 55.3% improvement -
Perceptual Evaluation of Speech Quality (PESQ): Increases from 1.47 to 2.30, 56.5% improvement -
Short-Time Objective Intelligibility (STOI): Increases from 0.839 to 0.921, 9.8% improvement
These data prove that LongCat-Audio-Codec can provide corresponding quality assurance at different bitrate requirements, allowing users to flexibly choose configurations based on actual application scenarios.
Speaker Similarity and Audio Quality Enhancement
Through the multi-stage training strategy, LongCat-Audio-Codec has also made significant progress in speaker similarity and audio quality:
As shown in the figure, after Stage 2 and Stage 3 training, speaker similarity significantly improved from 0.717 to 0.938. This means the reconstructed speech is not only accurate in content but also closer to the original speaker in timbre, tone, and prosody.
In terms of audio quality, the 24kHz decoder trained in Stage 2 showed significant improvement across multiple objective quality metrics, even surpassing the quality level of the original test set in some dimensions.
Getting Started with LongCat-Audio-Codec
Environment Setup and Configuration
LongCat-Audio-Codec is built on PyTorch, making installation straightforward:
# Create conda environment
conda create -n LongCat-Audio-Codec python=3.10
conda activate LongCat-Audio-Codec
# Install PyTorch (choose the appropriate version for your hardware configuration)
pip install torch==2.7.1 torchaudio==2.7.1
# Install other dependencies
pip install -r requirements.txt
Model Download
LongCat-Audio-Codec provides multiple pre-trained models for users to choose from based on their needs:
| Model Name | Description | Download Link |
|---|---|---|
| LongCatAudioCodec_encoder | Weights including semantic encoder and acoustic encoder | Hugging Face |
| LongCatAudioCodec_encoder_cmvn | Cepstral Mean and Variance Normalization coefficients | Hugging Face |
| LongCatAudioCodec_decoder16k_4codebooks | 16kHz decoder, supporting up to 3 acoustic codebooks | Hugging Face |
| LongCatAudioCodec_decoder24k_2codebooks | 24kHz decoder, supporting 1 acoustic codebook, fine-tuned on limited speakers | Hugging Face |
| LongCatAudioCodec_decoder24k_4codebooks | 24kHz decoder, supporting up to 3 acoustic codebooks | Hugging Face |
Project Structure Setup
After downloading the models, ensure the project structure is correct:
LongCat-Audio-Codec/
├── ckpts/
│ ├── LongCatAudioCodec_decoder_16k_4codebooks.pt
│ ├── LongCatAudioCodec_decoder_24k_2codebooks.pt
│ ├── LongCatAudioCodec_decoder_24k_4codebooks.pt
│ ├── LongCatAudioCodec_encoder.pt
│ └── LongCatAudioCodec_encoder_cmvn.npy
├── configs/
│ ├── LongCatAudioCodec_decoder_16k_4codebooks.yaml
│ ├── LongCatAudioCodec_decoder_24k_2codebooks.yaml
│ ├── LongCatAudioCodec_decoder_24k_4codebooks.yaml
│ └── LongCatAudioCodec_encoder.yaml
├── inference.py
└── run_inference.sh
Running Demos
The project provides a simple demonstration script to quickly experience LongCat-Audio-Codec’s functionality:
bash ./run_inference.sh
This script automatically processes demonstration audio files and generates reconstructed audio output. Users can find the processing results in the demo_audio_output/ directory.
Custom Usage
For users with specific requirements, they can directly call the inference.py script with custom parameters:
python inference.py \
--encoder_config "configs/LongCatAudioCodec_encoder.yaml" \
--decoder16k_config "configs/LongCatAudioCodec_decoder_16k_4codebooks.yaml" \
--decoder24k_config "configs/LongCatAudioCodec_decoder_24k_4codebooks.yaml" \
--output_dir "my_custom_output" \
--n_acoustic_codebooks 3 \
--audio_files "path/to/my.wav"
Practical Application Scenarios for LongCat-Audio-Codec
Infrastructure for Speech Large Language Models
As a codec specifically designed for speech large language models, LongCat-Audio-Codec can convert speech signals into discrete token sequences that can directly serve as input for large language models. Simultaneously, it can transform token sequences generated by models back into high-quality speech signals, completing end-to-end speech processing pipelines.
Low-Bitrate Speech Communication
In bandwidth-constrained environments, such as mobile networks or remote area networks, LongCat-Audio-Codec’s extremely low bitrate characteristics (minimum 0.43 kbps) can significantly improve speech communication quality while reducing data usage.
Speech Synthesis and Cloning
Leveraging the multi-stage training strategy, particularly the speaker-specific fine-tuning in Stage 3, LongCat-Audio-Codec can generate highly natural speech and accurately reproduce specific speaker timbre characteristics.
Audio Storage and Archiving
For applications requiring large-scale speech data storage, such as voice assistant conversation records or customer service recordings, LongCat-Audio-Codec can dramatically reduce storage space requirements while maintaining intelligibility.
Limitations of LongCat-Audio-Codec
Despite LongCat-Audio-Codec’s excellent performance in multiple aspects, the current version still has some limitations:
-
Primarily Optimized for Speech: The current version is mainly optimized for speech signals, with limited support for music and sound effects.
-
Input Length Restrictions: The model can process up to 30 seconds of audio input, with longer audio requiring pre-segmentation.
-
Speaker Dependency of Specific Decoders:
LongCatAudioCodec_decoder_24k_2codebooks.pthas been fine-tuned on limited speakers. If input audio comes from speakers outside the training set, reconstruction quality may degrade. -
Mono Audio Support Only: The current version only supports mono audio processing, not stereo.
The team has indicated they will optimize and improve these limitations in subsequent versions.
Frequently Asked Questions
How does LongCat-Audio-Codec compare to other audio codecs like EnCodec and DAC?
LongCat-Audio-Codec is specifically designed for speech large language models, employing a semantic-acoustic decoupled architecture that maintains good semantic understanding capabilities even at extremely low bitrates. Compared to traditional codecs, it significantly reduces bitrate while maintaining high speech intelligibility, making it more suitable as front-end and back-end for large language models.
How to choose the appropriate number of codebooks?
The choice of codebook quantity depends on specific application scenarios:
-
Bandwidth-extremely-sensitive scenarios: Choose 2 codebooks (0.43 kbps) -
Scenarios balancing quality and efficiency: Choose 3 codebooks (0.65 kbps) -
Scenarios requiring higher audio quality: Choose 4 codebooks (0.87 kbps)
Users can determine the most suitable configuration for their needs through experimentation.
Does LongCat-Audio-Codec support real-time processing?
Yes, LongCat-Audio-Codec’s decoder uses streaming design, requiring only 180 milliseconds of future information to generate high-quality audio, making it very suitable for real-time applications.
Is specialized hardware required to run LongCat-Audio-Codec?
No specialized hardware is needed. LongCat-Audio-Codec can run in standard CPU and GPU environments. However, using GPU can significantly improve processing speed, especially when handling large amounts of audio data.
Can the model be fine-tuned with my own data?
Yes, the team provides a multi-stage training strategy. Users can refer to the Stage 3 method to fine-tune the decoder with their own data to adapt to specific speakers or audio characteristics.
Conclusion
LongCat-Audio-Codec represents a significant advancement in audio codec technology integrated with large language models. Through innovative semantic-acoustic separation architecture, multi-stage training strategy, and flexible codebook configuration, it achieves high-quality audio reconstruction at extremely low bitrates, providing reliable infrastructure for speech large language model development.
As speech technology continues to advance, we have reason to believe that technologies like LongCat-Audio-Codec will play increasingly important roles in future speech interaction, speech synthesis, and speech communication fields. Both researchers and developers can utilize this open-source tool to explore more possibilities in speech technology.
Citations and Resources
If you use LongCat-Audio-Codec in your research, please cite the following paper:
@article{longcataudiocodec,
title={LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models},
author={Xiaohan Zhao, Hongyu Xiang, Shengze Ye, Song Li, Zhengkun Tian, Guanyu Chen, Ke Ding, Guanglu Wan},
journal={arXiv preprint arXiv:2510.15227},
organization={LongCat Team, Meituan},
year={2025}
}
Project Resources:
