Recent Advances in Speech Language Models: A Comprehensive Technical Survey
The Evolution of Voice AI
🎉 Cutting-Edge Research Alert: Our comprehensive survey paper “Recent Advances in Speech Language Models” has been accepted for publication at ACL 2025, the premier natural language processing conference. This work systematically examines Speech Language Models (SpeechLMs) – transformative AI systems enabling end-to-end voice conversations with human-like fluidity. [Full Paper]
Why SpeechLMs Matter
Traditional voice assistants follow a fragmented ASR (Speech Recognition) → LLM (Language Processing) → TTS (Speech Synthesis) pipeline with inherent limitations:
-
Information Loss: Conversion to text strips vocal emotions and intonations -
Error Propagation: Mistakes compound across processing stages -
Latency Issues: Multi-stage processing delays responses
SpeechLMs revolutionize interaction by directly modeling raw audio signals:
-
✅ Preserve paralinguistic features (tone, emotion, emphasis) -
✅ Enable real-time bidirectional conversations -
✅ Eliminate error-prone intermediate steps -
✅ Unify understanding/generation in single architecture

Technical Taxonomy of SpeechLMs
We propose a novel classification framework categorizing SpeechLMs by architectural paradigm:

Core Architecture Types
Category | Representative Models | Key Innovation |
---|---|---|
Unified Encoder-Decoder | Qwen2.5-Omni, MiniCPM-o | Single-model speech-to-speech processing |
Hybrid Multimodal | AudioPaLM, VioLA | Joint text-speech representations |
Speech-First | GSLM, pGSLM | Audio-only training without text labels |
SpeechLM Ecosystem: Models & Capabilities
Industry Production Systems
-
OpenAI Advanced Voice Mode
Enterprise-grade conversational AI [Technical Documentation] -
Claude Voice Mode
Low-latency mobile interaction system [Implementation Details] -
Kimi-Audio
Chinese-optimized voice assistant [Technical Report]
Efficiency-Optimized Architectures
-
VITA-Audio
Cross-modal token interleaving for real-time response [Research Paper] -
Slamming
Trains in 1 day on single GPU [Methodology] -
Mini-Omni
Mobile-optimized for on-device deployment [Implementation]
Conversation-Specialized Models
-
NTPP (Next-Token-Pair Prediction)
Dual-channel dialogue architecture [Novel Approach] -
SyncLLM
Full-duplex interaction beyond turn-taking [Breakthrough]
Complete model registry: SpeechLM Collection on GitHub
Tokenization: The Foundation of SpeechLMs
Converting continuous speech to discrete tokens enables language modeling:
Semantic Tokenizers (Meaning-Focused)
System | Core Innovation | Key Application |
---|---|---|
Whisper | Weakly-supervised learning | Multilingual ASR |
HuBERT | Masked unit prediction | General speech pre-training |
WavLM | Full-stack speech processing | Conversational optimization |
Acoustic Tokenizers (Sound-Focused)
Codec | Compression Ratio | Fidelity Preservation |
---|---|---|
Encodec | 8x-24x | Studio-quality reconstruction |
SoundStream | 12x-32x | Low-latency streaming |
WavTokenizer | Dynamic codebooks | Efficient LM integration |
Hybrid Tokenizers
-
SpeechTokenizer: Unified framework for SpeechLLMs [Technical Approach] -
Mimi: Speech-text joint tokenization [Design]
Training Data Landscape
Pre-training Corpora
Dataset | Content Type | Volume (Hours) | Key Features |
---|---|---|---|
LibriLight | Read speech | 60,000 | Unlabeled large-scale collection |
Multilingual LibriSpeech | 8-language ASR | 50,500 | Cross-lingual coverage |
Spotify Podcasts | Natural conversations | 47,000 | Spontaneous speech diversity |
VoxCeleb2 | Speaker identification | 2,400 | 6,112 unique speakers |
Instruction-Tuning Datasets
-
SpeechInstruct
Voice command collection [Dataset Card] -
VoiceAssistant-400K
400K human-AI dialogue samples [Access]
Evaluation Frameworks
Capability Assessment
Benchmark | Evaluation Dimension | Tasks | Core Focus |
---|---|---|---|
sBLIMP | Linguistic competence | 1 | Grammatical sensitivity |
STSP | Paralinguistic analysis | 1 | Emotion/prosody preservation |
Dynamic-SUPERB | Multi-task generalization | 180 | Speech/music/environmental sound |
Real-World Performance
-
VoxEval
56-task comprehensive evaluation [GitHub] -
VoiceBench
Pure speech input-output assessment [Framework] -
AIR-Bench
20 cross-modal tasks [Toolkit]
Emerging Trends & Future Directions
-
Resource Efficiency: Models like Slamming enabling single-GPU training -
Full-Duplex Interaction: SyncLLM achieving natural conversation flow -
Multilingual Expansion: VITA-1.5 supporting 50+ languages -
Open-Source Momentum: Mini-Omni and other community-driven initiatives
New Development: We’ve launched VoxEval benchmark for comprehensive knowledge evaluation [Paper] [Code]
References
@article{cui2024recent,
title={Recent advances in speech language models: A survey},
author={Cui, Wenqian and Yu, Dianzhi and Jiao, Xiaoqi and Meng, Ziqiao and Zhang, Guangyan and Wang, Qichao and Guo, Yiwen and King, Irwin},
journal={arXiv preprint arXiv:2410.03751},
year={2024}
}