Voxtral Mini 4B Realtime 2602: Low-Latency Open-Source Real-Time Speech Transcription Model
Voxtral Mini 4B Realtime 2602 is a multilingual real-time speech-to-text model that achieves an average word error rate (WER) of 8.72% on the FLEURS benchmark at 480ms latency across 13 languages, approaching the 5.90% WER of its offline counterpart. The 4B-parameter model uses a native streaming architecture with causal audio encoder and sliding window attention, supporting configurable delays from 240ms to 2.4s. It runs at over 12.5 tokens/second on a single GPU with ≥16GB VRAM, making it suitable for voice assistants, live subtitling, and on-device deployment under Apache 2.0 license.
Introduction to Voxtral Mini 4B Realtime 2602
Real-time speech transcription has become essential for voice assistants, live subtitling, and multilingual meetings. Voxtral Mini 4B Realtime 2602 stands out as an open-source solution that delivers accuracy comparable to offline systems while maintaining latency under 500ms. It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian.
The model totals approximately 4 billion parameters, consisting of a 3.4B language model and a 0.6B audio encoder. The audio encoder was trained from scratch with causal attention for streaming. Both components use sliding window attention, enabling effectively unlimited streaming without fixed context windows. At the recommended 480ms delay, its performance matches leading offline open-source transcription models while significantly outperforming existing open-source real-time baselines.
Released in BF16 precision under the Apache 2.0 license, the model supports both research and commercial use. Its efficient design allows real-time operation on edge devices, ensuring privacy by keeping sensitive audio processing local.
Key Features
What makes Voxtral Mini 4B Realtime 2602 different from chunk-based approaches? It uses a native streaming architecture that transcribes audio as it arrives, with fully configurable latency.
Key capabilities include:
-
High-quality transcription with confidence scores -
Native support for 13 languages -
Real-time streaming ASR designed for live use cases -
Configurable transcription delays from 240ms to 2.4s
The architecture enables users to trade latency for accuracy depending on the application. For example, shorter delays suit interactive voice agents, while longer delays improve accuracy for subtitling.
Benchmark Results
Voxtral Mini 4B Realtime 2602 demonstrates competitive performance against both offline and real-time models. Here are the detailed results.
FLEURS Benchmark (13 languages)
| Model | Delay | AVG | Arabic | German | English | Spanish | French | Hindi | Italian | Dutch | Portuguese | Chinese | Japanese | Korean | Russian |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Voxtral Mini Transcribe 2.0 | Offline | 5.90% | 13.54% | 3.54% | 3.32% | 2.63% | 4.32% | 10.33% | 2.17% | 4.78% | 3.56% | 7.30% | 4.14% | 12.29% | 4.75% |
| Voxtral Mini 4B Realtime 2602 | 480 ms | 8.72% | 22.53% | 6.19% | 4.90% | 3.31% | 6.42% | 12.88% | 3.27% | 7.07% | 5.03% | 10.45% | 9.59% | 15.74% | 6.02% |
| 160 ms | 12.60% | 24.33% | 9.50% | 6.46% | 5.34% | 9.75% | 15.28% | 5.59% | 11.39% | 10.01% | 17.67% | 19.17% | 19.81% | 9.53% | |
| 240 ms | 10.80% | 23.95% | 8.15% | 5.91% | 4.59% | 8.00% | 14.26% | 4.41% | 9.23% | 7.51% | 13.84% | 15.17% | 17.56% | 7.87% | |
| 960 ms | 7.70% | 20.32% | 4.87% | 4.34% | 2.98% | 5.68% | 11.82% | 2.46% | 6.76% | 4.57% | 8.99% | 6.80% | 14.90% | 5.56% | |
| 2400 ms | 6.73% | 14.71% | 4.15% | 4.05% | 2.71% | 5.23% | 10.73% | 2.37% | 5.91% | 3.93% | 8.48% | 5.50% | 14.30% | 5.41% |
At 480ms, the average WER is 8.72%, with English at 4.90% (vs. offline 3.32%) and Spanish at 3.31% (vs. offline 2.63%). Extending to 2.4s brings the average down to 6.73%.

Word error rate (lower is better) across languages in the FLEURS transcription benchmark.
Long-form English Benchmarks
| Model | Delay | Meanwhile (<10m) | E-21 (<10m) | E-22 (<10m) | TEDLIUM (<20m) |
|---|---|---|---|---|---|
| Voxtral Mini Transcribe 2.0 | Offline | 4.08% | 9.81% | 11.69% | 2.86% |
| Voxtral Mini 4B Realtime 2602 | 480ms | 5.05% | 10.23% | 12.30% | 3.17% |
Short-form English Benchmarks
| Model | Delay | CHiME-4 | GigaSpeech 2k Subset | AMI IHM | SwitchBoard | CHiME-4 SP | GISpeech 2k Subset |
|---|---|---|---|---|---|---|---|
| Voxtral Mini Transcribe 2.0 | Offline | 10.39% | 6.81% | 14.43% | 11.54% | 10.42% | 1.74% |
| Voxtral Mini 4B Realtime 2602 | 480ms | 10.50% | 7.35% | 15.05% | 11.65% | 12.41% | 1.73% |
The results show that at 480ms latency, the model maintains practical accuracy for real-world use, with gaps to offline performance typically under 1% in many English short-form tests.
Use Cases
Have you wondered where low-latency real-time transcription really shines? Here are practical applications:
-
Private meeting transcriptions -
Live subtitle creation for broadcasts and events -
Real-time assistants that understand speech instantly -
Voice agents requiring natural conversation flow
The 480ms sweet spot makes interactions feel responsive, while the multilingual capability supports global teams and international content.
Its 4B parameter size and edge-friendly design make it particularly valuable for privacy-sensitive deployments where audio never leaves the device.
Recommended Settings and Best Practices
For optimal results, follow these precise recommendations:
-
Set temperature to exactly 0.0 for deterministic output -
One text token corresponds to approximately 80ms of audio. For a 1-hour meeting, set --max-model-len >= 45000. The default 131072 tokens support roughly 3 hours -
Use WebSockets for audio streaming sessions -
Start with 480ms transcription delay in params.json, adjustable via "transcription_delay_ms": 480
These settings balance throughput, memory usage, and latency effectively.
Step-by-Step: How to Deploy and Use Voxtral Mini 4B Realtime 2602
Installation
-
Install the nightly vLLM package:
uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly -
Verify mistral_common version:
python -c "import mistral_common; print(mistral_common.__version__)" -
Install audio processing dependencies:
uv pip install soxr librosa soundfile
Docker images are also available for easier deployment.
Serving the Model
The model runs on a single GPU with at least 16GB VRAM. Launch command:
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'
Optional flags:
-
Adjust --max-num-batched-tokensto trade throughput for latency -
Reduce --max-model-lengthif transcribing shorter sessions to save RoPE memory
After startup, the realtime endpoint appears at /v1/realtime.
Usage Examples
vLLM provides ready examples for:
-
Streaming audio files -
Gradio-based live microphone transcription

Try the live demo: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime
WebSocket streaming ensures continuous low-latency audio processing.
License
Voxtral Mini 4B Realtime 2602 is released under the Apache 2.0 License. Usage must not infringe any third-party rights, including intellectual property.
Frequently Asked Questions (FAQ)
Which languages does Voxtral Mini 4B Realtime 2602 support?
It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian. Performance is strong in Romance languages, with Spanish achieving 3.31% WER at 480ms.
How do I balance latency and accuracy?
Use the recommended 480ms delay (average WER 8.72%). Shorter 240ms increases WER to 10.80%, while 2.4s reduces it to 6.73%. Modify "transcription_delay_ms" in params.json.
What hardware is required?
A single GPU with ≥16GB VRAM delivers real-time performance exceeding 12.5 tokens/second. Default max-model-len of 131072 supports approximately 3 hours of audio.
How should I handle long audio recordings?
Calculate tokens needed using 1 token ≈ 80ms. Set --max-model-len accordingly. The default handles ~3-hour sessions reliably.
Why set temperature to 0.0?
It ensures consistent, deterministic transcription results without randomness.
How does the realtime endpoint work?
vLLM’s Realtime API uses WebSockets to handle continuous audio streaming for production-grade sessions.
How does real-time performance compare to offline models?
At 480ms, the FLEURS average gap is 2.82% (8.72% vs 5.90%). In many English tests, the difference is under 1%, making it viable for interactive applications.
Does it support speaker diarization?
The current real-time version focuses on transcription. Batch models in the Voxtral family provide speaker labels and word-level timestamps.
What deployment best practices should I follow?
Always use WebSockets for streaming, install nightly vLLM for compatibility, and consider Docker for platform consistency.
This guide covers the essential details to get you started with Voxtral Mini 4B Realtime 2602 for production real-time transcription.

