Speech Processingarchive | Efficient Coder

LongCat-Audio-Codec Revolutionizes Speech LLMs with Ultra-Low Bitrate Speech Encoding

4 months ago 高效码农

LongCat-Audio-Codec: The Audio Tokenizer and Detokenizer Solution Revolutionizing Speech Large Language Models In the rapidly evolving landscape of speech large language models, achieving high-quality audio reconstruction at low bitrates has emerged as a critical technological bottleneck. The open-source audio codec from Meituan’s LongCat team delivers a stunning solution to this challenge. Understanding Audio Codecs and Their Critical Role in Speech LLMs If you’ve ever used voice assistants, video conferencing software, or any audio processing tool, you’ve indirectly experienced audio codec technology. In simple terms, an audio codec acts as a “compression package” for audio data—it condenses massive raw audio signals …

Ming-UniAudio: A Revolutionary Framework Unifying Speech Understanding, Generation, and Editing

4 months ago 高效码农

Introduction Core question this article addresses: How can we build a single model capable of simultaneously handling speech understanding, generation, and editing tasks? Ming-UniAudio achieves this breakthrough through its innovative unified continuous speech tokenizer and end-to-end speech language model, pioneering timestamp-free free-form speech editing that transforms the speech processing landscape. In artificial intelligence, speech processing has long faced fragmentation between understanding, generation, and editing tasks. Traditional approaches either separated speech representations for different tasks or used discrete representations that lost speech details. Ming-UniAudio emerges as the first framework unifying speech understanding, generation, and editing through its core unified continuous speech …

Breaking Language Barriers: How MiniMax-Speech’s Zero-Shot TTS Redefines Voice Cloning

10 months ago 高效码农

MiniMax-Speech: Revolutionizing Zero-Shot Text-to-Speech with Learnable Speaker Encoder and Flow-VAE Technology 1. Core Innovations and Architecture Design 1.1 Architectural Overview MiniMax-Speech leverages an 「autoregressive Transformer architecture」 to achieve breakthroughs in zero-shot voice cloning. Key components include: 「Learnable Speaker Encoder」: Extracts speaker timbre from reference audio without transcriptions (jointly trained end-to-end) 「Flow-VAE Hybrid Model」: Combines variational autoencoder (VAE) and flow models, achieving KL divergence of 0.62 (vs. 0.67 in traditional VAEs) 「Multilingual Support」: 32 languages with Word Error Rate (WER) as low as 0.83 (Chinese) and 1.65 (English) Figure 1: MiniMax-Speech system diagram (Conceptual illustration) 1.2 Technical Breakthroughs (1) Zero-Shot Voice …