Voxtral Mini 4B Realtime 2602: Open-Source Speech Recognition for Live, Low-Latency Transcription

5 hours ago 高效码农

Voxtral Mini 4B Realtime 2602: Low-Latency Open-Source Real-Time Speech Transcription Model Voxtral Mini 4B Realtime 2602 is a multilingual real-time speech-to-text model that achieves an average word error rate (WER) of 8.72% on the FLEURS benchmark at 480ms latency across 13 languages, approaching the 5.90% WER of its offline counterpart. The 4B-parameter model uses a native streaming architecture with causal audio encoder and sliding window attention, supporting configurable delays from 240ms to 2.4s. It runs at over 12.5 tokens/second on a single GPU with ≥16GB VRAM, making it suitable for voice assistants, live subtitling, and on-device deployment under Apache 2.0 …