Kimi-Audio: The Audio Foundation Model Redefining Speech & Sound Processing

高效码农

3 months ago

Kimi-Audio: A Groundbreaking Technology in Audio Processing

In today’s digital age, audio processing technology is becoming increasingly vital, playing a crucial role in various fields such as speech recognition, music generation, emotion expression, and environmental perception. However, traditional audio processing methods have limitations as they often handle each task separately, making it difficult to adapt to diverse scenarios. Against this backdrop, Kimi-Audio, an open-source audio foundation model developed by MoonshotAI, is reshaping the audio processing landscape with its superior audio understanding, generation, and conversation capabilities.

Core Architecture of Kimi-Audio

Kimi-Audio boasts a sophisticated architecture comprising three key components: the Audio Tokenizer, the Audio LLM, and the Audio Detokenizer. This tripartite structure empowers Kimi-Audio to handle a wide range of audio-related tasks within a unified framework.

Audio Tokenizer: The Audio Tokenizer converts input audio into discrete semantic tokens at a rate of 12.5Hz through vector quantization. It also extracts continuous acoustic vectors from the Whisper encoder to enhance the model’s perception capabilities. This hybrid tokenization strategy combines the efficiency of discrete tokens with the rich acoustic details of continuous representations, providing a comprehensive foundation for diverse audio processing tasks.
Audio LLM: At the heart of Kimi-Audio, the Audio LLM is based on a pre-trained text LLM. It processes multimodal inputs through shared transformer layers and branches into two parallel heads for text and audio generation. This design allows Kimi-Audio to generate both audio semantic tokens and corresponding textual responses, significantly improving its generation capabilities.
Audio Detokenizer: The Audio Detokenizer converts the discrete semantic tokens predicted by the Audio LLM back into coherent audio waveforms using a flow matching approach. Kimi-Audio employs a chunk-wise streaming detokenizer with a look-ahead mechanism to reduce speech generation latency and improve audio quality.

Large-scale Data Processing: The Cornerstone of Kimi-Audio

Kimi-Audio’s remarkable performance is underpinned by its large-scale audio data processing capabilities. The development team curated a pre-training dataset exceeding 13 million hours of audio data, encompassing speech, sound, and music. To ensure data quality and diversity, they established a data processing pipeline that includes speech enhancement, speech segmentation, speech transcription, and data filtering.

Speech Enhancement: Using a Band-Split RNN (BSRNN) architecture, the team developed a speech enhancement model to suppress background noise and reverberation. During the pre-training phase, they randomly selected between original and enhanced audio to retain environmental sounds and music information.
Speech Segmentation: Leveraging speaker diarization technology, long-form audio was segmented into individual speaker turns. Through a series of post-processing steps, such as speaker cluster merging, chunk-based reassignment, and segment merging, the team improved the accuracy and consistency of the segmentation results.
Speech Transcription: The Whisper-large-v3 model was utilized to detect spoken language types and generate transcriptions for English segments. For Mandarin segments, the Paraformer-Zh model from the FunASR toolkit was employed. A time gap strategy was applied to add punctuation annotations to Mandarin transcriptions.

Training Strategy: From Pre-training to Fine-tuning

Kimi-Audio’s training process consists of two stages: pre-training and supervised fine-tuning (SFT). The pre-training stage aims to learn knowledge from both audio and text domains and align them in the model’s latent space. The fine-tuning stage further enhances the model’s performance on specific tasks.

Pre-training Tasks: Pre-training tasks include single-modal (audio and text) pre-training, audio-to-text mapping pre-training, and audio-text interleaving pre-training. These tasks promote audio and text fusion through various methods, such as automatic speech recognition (ASR) and text-to-speech synthesis (TTS) tasks, as well as audio-to-semantic token, audio-to-text, and audio-to-semantic token + text interleaving tasks.
Supervised Fine-tuning (SFT): In the fine-tuning stage, Kimi-Audio was equipped with instruction-following capabilities. Natural language was used as task prompts, and both audio and text versions of the prompts were constructed. Additionally, multiple instructions were created for different tasks to enhance the model’s robustness in following instructions.

Inference and Deployment: Practical Applications of Kimi-Audio

Kimi-Audio is designed to handle various audio-related tasks, such as speech recognition, audio understanding, audio-to-text chat, and speech-to-speech conversation. Taking real-time speech-to-speech conversation as an example, the inference process between the client (e.g., Kimi APP or web browser) and the server (Kimi-Audio service) involves the following steps: audio data collection and streaming, voice activity detection (VAD) on the server side, inference process initiation upon user completion of speaking, and real-time audio chunk generation and playback on the client side.

In a production environment, all core components of Kimi-Audio require substantial computational resources. Therefore, the team designed a scalable and efficient infrastructure architecture, including the Kimi-Audio RTC Service, Inference Scheduler, and Tokenizer/Detokenizer/LLM Services, to ensure Kimi-Audio can meet the performance demands of real-time speech interaction while maintaining low latency and high availability.

Evaluation Results: Outstanding Performance of Kimi-Audio

To evaluate Kimi-Audio’s performance and compare it with other state-of-the-art systems, the development team created a fair, reproducible, and comprehensive evaluation toolkit. Based on this toolkit, they assessed Kimi-Audio across various audio processing tasks, including automatic speech recognition (ASR), audio understanding, audio-to-text chat, and speech conversation.

Automatic Speech Recognition (ASR): Kimi-Audio demonstrated superior ASR capabilities across diverse datasets in multiple languages and acoustic conditions. For instance, on the widely-used LibriSpeech benchmark, Kimi-Audio achieved error rates of 1.28 on test-clean and 2.42 on test-other, outperforming models like Qwen2-Audio-base and Qwen2.5-Omni. It also set new state-of-the-art results on Mandarin ASR benchmarks such as AISHELL-1 (0.60) and AISHELL-2 ios (2.56).
Audio Understanding: Beyond speech recognition, Kimi-Audio showcased advanced capabilities in understanding diverse audio signals, including music, sound events, and speech. On the MMAU benchmark, it achieved scores of 61.68 in music, 73.27 in sound, and 60.66 in speech categories. It also led in tasks such as speech emotion recognition on the MELD benchmark (59.13) and non-speech sound classification on VocalSound (94.85) and Nonspeech7k (93.93) datasets.
Audio-to-Text Chat: Kimi-Audio excelled in engaging in text conversations based on audio input. On the OpenAudioBench and VoiceBench benchmarks, it achieved state-of-the-art performance on several sub-tasks, including AlpacaEval, Llama Questions, and TriviaQA, and highly competitive results on Reasoning QA and Web Questions.
Speech Conversation: Subjective evaluations across multiple dimensions showed that Kimi-Audio achieved the highest scores in emotion control, empathy, and speed control, with an overall average score of 3.90, surpassing models like Step-Audio-chat (3.33), GPT-4o-mini (3.45), and GLM-4-Voice (3.65).

Challenges and Future Trends

Despite Kimi-Audio’s significant advancements in building universal audio foundation models, challenges remain in the pursuit of more capable and intelligent audio processing systems. For example, current audio foundation models typically leverage audio-text pre-training to bridge the gap between text and audio, where the text is obtained from audio (speech) via ASR transcription. However, this transcription focuses on the content of spoken words, neglecting important audio information such as paralanguage features, acoustic scenes, and non-linguistic sounds. Future research directions include incorporating descriptive text to enrich audio context, developing better audio representations that integrate transcription-oriented semantic information and description-oriented acoustic features, and moving away from reliance on ASR and TTS to achieve more autonomous audio intelligence.

Conclusion

Kimi-Audio, with its innovative architecture, large-scale data processing capabilities, well-designed training strategy, and outstanding evaluation performance, has set a new benchmark in audio processing. It has achieved state-of-the-art performance in speech recognition, audio understanding, audio generation, and speech conversation, and has made a significant contribution to the audio processing community by being open-source. As technology continues to evolve, Kimi-Audio is expected to further push the boundaries of audio processing technology, paving the way for smarter and more efficient audio applications.