Voxtral Mini 4B Realtime 2602: Open-Source Speech Recognition for Live, Low-Latency Transcription

Voxtral Mini 4B Realtime 2602: Low-Latency Open-Source Real-Time Speech Transcription Model

Voxtral Mini 4B Realtime 2602 is a multilingual real-time speech-to-text model that achieves an average word error rate (WER) of 8.72% on the FLEURS benchmark at 480ms latency across 13 languages, approaching the 5.90% WER of its offline counterpart. The 4B-parameter model uses a native streaming architecture with causal audio encoder and sliding window attention, supporting configurable delays from 240ms to 2.4s. It runs at over 12.5 tokens/second on a single GPU with ≥16GB VRAM, making it suitable for voice assistants, live subtitling, and on-device deployment under Apache 2.0 license.

Introduction to Voxtral Mini 4B Realtime 2602

Real-time speech transcription has become essential for voice assistants, live subtitling, and multilingual meetings. Voxtral Mini 4B Realtime 2602 stands out as an open-source solution that delivers accuracy comparable to offline systems while maintaining latency under 500ms. It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian.

The model totals approximately 4 billion parameters, consisting of a 3.4B language model and a 0.6B audio encoder. The audio encoder was trained from scratch with causal attention for streaming. Both components use sliding window attention, enabling effectively unlimited streaming without fixed context windows. At the recommended 480ms delay, its performance matches leading offline open-source transcription models while significantly outperforming existing open-source real-time baselines.

Released in BF16 precision under the Apache 2.0 license, the model supports both research and commercial use. Its efficient design allows real-time operation on edge devices, ensuring privacy by keeping sensitive audio processing local.

Key Features

What makes Voxtral Mini 4B Realtime 2602 different from chunk-based approaches? It uses a native streaming architecture that transcribes audio as it arrives, with fully configurable latency.

Key capabilities include:

High-quality transcription with confidence scores
Native support for 13 languages
Real-time streaming ASR designed for live use cases
Configurable transcription delays from 240ms to 2.4s

The architecture enables users to trade latency for accuracy depending on the application. For example, shorter delays suit interactive voice agents, while longer delays improve accuracy for subtitling.

Benchmark Results

Voxtral Mini 4B Realtime 2602 demonstrates competitive performance against both offline and real-time models. Here are the detailed results.

FLEURS Benchmark (13 languages)

Model	Delay	AVG	Arabic	German	English	Spanish	French	Hindi	Italian	Dutch	Portuguese	Chinese	Japanese	Korean	Russian
Voxtral Mini Transcribe 2.0	Offline	5.90%	13.54%	3.54%	3.32%	2.63%	4.32%	10.33%	2.17%	4.78%	3.56%	7.30%	4.14%	12.29%	4.75%
Voxtral Mini 4B Realtime 2602	480 ms	8.72%	22.53%	6.19%	4.90%	3.31%	6.42%	12.88%	3.27%	7.07%	5.03%	10.45%	9.59%	15.74%	6.02%
	160 ms	12.60%	24.33%	9.50%	6.46%	5.34%	9.75%	15.28%	5.59%	11.39%	10.01%	17.67%	19.17%	19.81%	9.53%
	240 ms	10.80%	23.95%	8.15%	5.91%	4.59%	8.00%	14.26%	4.41%	9.23%	7.51%	13.84%	15.17%	17.56%	7.87%
	960 ms	7.70%	20.32%	4.87%	4.34%	2.98%	5.68%	11.82%	2.46%	6.76%	4.57%	8.99%	6.80%	14.90%	5.56%
	2400 ms	6.73%	14.71%	4.15%	4.05%	2.71%	5.23%	10.73%	2.37%	5.91%	3.93%	8.48%	5.50%	14.30%	5.41%

At 480ms, the average WER is 8.72%, with English at 4.90% (vs. offline 3.32%) and Spanish at 3.31% (vs. offline 2.63%). Extending to 2.4s brings the average down to 6.73%.

Word error rate (lower is better) across languages in the FLEURS transcription benchmark.

Long-form English Benchmarks

Model	Delay	Meanwhile (<10m)	E-21 (<10m)	E-22 (<10m)	TEDLIUM (<20m)
Voxtral Mini Transcribe 2.0	Offline	4.08%	9.81%	11.69%	2.86%
Voxtral Mini 4B Realtime 2602	480ms	5.05%	10.23%	12.30%	3.17%

Short-form English Benchmarks

Model	Delay	CHiME-4	GigaSpeech 2k Subset	AMI IHM	SwitchBoard	CHiME-4 SP	GISpeech 2k Subset
Voxtral Mini Transcribe 2.0	Offline	10.39%	6.81%	14.43%	11.54%	10.42%	1.74%
Voxtral Mini 4B Realtime 2602	480ms	10.50%	7.35%	15.05%	11.65%	12.41%	1.73%

The results show that at 480ms latency, the model maintains practical accuracy for real-world use, with gaps to offline performance typically under 1% in many English short-form tests.

Use Cases

Have you wondered where low-latency real-time transcription really shines? Here are practical applications:

Private meeting transcriptions
Live subtitle creation for broadcasts and events
Real-time assistants that understand speech instantly
Voice agents requiring natural conversation flow

The 480ms sweet spot makes interactions feel responsive, while the multilingual capability supports global teams and international content.

Its 4B parameter size and edge-friendly design make it particularly valuable for privacy-sensitive deployments where audio never leaves the device.

Recommended Settings and Best Practices

For optimal results, follow these precise recommendations:

Set temperature to exactly 0.0 for deterministic output
One text token corresponds to approximately 80ms of audio. For a 1-hour meeting, set --max-model-len >= 45000. The default 131072 tokens support roughly 3 hours
Use WebSockets for audio streaming sessions
Start with 480ms transcription delay in params.json, adjustable via "transcription_delay_ms": 480

These settings balance throughput, memory usage, and latency effectively.

Step-by-Step: How to Deploy and Use Voxtral Mini 4B Realtime 2602

Installation

Install the nightly vLLM package:

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Verify mistral_common version:

python -c "import mistral_common; print(mistral_common.__version__)"

Install audio processing dependencies:
```
uv pip install soxr librosa soundfile
```

Docker images are also available for easier deployment.

Serving the Model

The model runs on a single GPU with at least 16GB VRAM. Launch command:

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

Optional flags:

Adjust --max-num-batched-tokens to trade throughput for latency
Reduce --max-model-length if transcribing shorter sessions to save RoPE memory

After startup, the realtime endpoint appears at /v1/realtime.

Usage Examples

vLLM provides ready examples for:

Streaming audio files
Gradio-based live microphone transcription

Try the live demo: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime

WebSocket streaming ensures continuous low-latency audio processing.

License

Voxtral Mini 4B Realtime 2602 is released under the Apache 2.0 License. Usage must not infringe any third-party rights, including intellectual property.

Frequently Asked Questions (FAQ)

Which languages does Voxtral Mini 4B Realtime 2602 support?
It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian. Performance is strong in Romance languages, with Spanish achieving 3.31% WER at 480ms.

How do I balance latency and accuracy?
Use the recommended 480ms delay (average WER 8.72%). Shorter 240ms increases WER to 10.80%, while 2.4s reduces it to 6.73%. Modify "transcription_delay_ms" in params.json.

What hardware is required?
A single GPU with ≥16GB VRAM delivers real-time performance exceeding 12.5 tokens/second. Default max-model-len of 131072 supports approximately 3 hours of audio.

How should I handle long audio recordings?
Calculate tokens needed using 1 token ≈ 80ms. Set --max-model-len accordingly. The default handles ~3-hour sessions reliably.

Why set temperature to 0.0?
It ensures consistent, deterministic transcription results without randomness.

How does the realtime endpoint work?
vLLM’s Realtime API uses WebSockets to handle continuous audio streaming for production-grade sessions.

How does real-time performance compare to offline models?
At 480ms, the FLEURS average gap is 2.82% (8.72% vs 5.90%). In many English tests, the difference is under 1%, making it viable for interactive applications.

Does it support speaker diarization?
The current real-time version focuses on transcription. Batch models in the Voxtral family provide speaker labels and word-level timestamps.

What deployment best practices should I follow?
Always use WebSockets for streaming, install nightly vLLM for compatibility, and consider Docker for platform consistency.

This guide covers the essential details to get you started with Voxtral Mini 4B Realtime 2602 for production real-time transcription.