Voxtral Mini 4B Realtime 2602: Low-Latency Open-Source Real-Time Speech Transcription Model

Voxtral Mini 4B Realtime 2602 is a multilingual real-time speech-to-text model that achieves an average word error rate (WER) of 8.72% on the FLEURS benchmark at 480ms latency across 13 languages, approaching the 5.90% WER of its offline counterpart. The 4B-parameter model uses a native streaming architecture with causal audio encoder and sliding window attention, supporting configurable delays from 240ms to 2.4s. It runs at over 12.5 tokens/second on a single GPU with ≥16GB VRAM, making it suitable for voice assistants, live subtitling, and on-device deployment under Apache 2.0 license.

Introduction to Voxtral Mini 4B Realtime 2602

Real-time speech transcription has become essential for voice assistants, live subtitling, and multilingual meetings. Voxtral Mini 4B Realtime 2602 stands out as an open-source solution that delivers accuracy comparable to offline systems while maintaining latency under 500ms. It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian.

The model totals approximately 4 billion parameters, consisting of a 3.4B language model and a 0.6B audio encoder. The audio encoder was trained from scratch with causal attention for streaming. Both components use sliding window attention, enabling effectively unlimited streaming without fixed context windows. At the recommended 480ms delay, its performance matches leading offline open-source transcription models while significantly outperforming existing open-source real-time baselines.

Released in BF16 precision under the Apache 2.0 license, the model supports both research and commercial use. Its efficient design allows real-time operation on edge devices, ensuring privacy by keeping sensitive audio processing local.

Key Features

What makes Voxtral Mini 4B Realtime 2602 different from chunk-based approaches? It uses a native streaming architecture that transcribes audio as it arrives, with fully configurable latency.

Key capabilities include:

  • High-quality transcription with confidence scores
  • Native support for 13 languages
  • Real-time streaming ASR designed for live use cases
  • Configurable transcription delays from 240ms to 2.4s

The architecture enables users to trade latency for accuracy depending on the application. For example, shorter delays suit interactive voice agents, while longer delays improve accuracy for subtitling.

Benchmark Results

Voxtral Mini 4B Realtime 2602 demonstrates competitive performance against both offline and real-time models. Here are the detailed results.

FLEURS Benchmark (13 languages)

Model Delay AVG Arabic German English Spanish French Hindi Italian Dutch Portuguese Chinese Japanese Korean Russian
Voxtral Mini Transcribe 2.0 Offline 5.90% 13.54% 3.54% 3.32% 2.63% 4.32% 10.33% 2.17% 4.78% 3.56% 7.30% 4.14% 12.29% 4.75%
Voxtral Mini 4B Realtime 2602 480 ms 8.72% 22.53% 6.19% 4.90% 3.31% 6.42% 12.88% 3.27% 7.07% 5.03% 10.45% 9.59% 15.74% 6.02%
160 ms 12.60% 24.33% 9.50% 6.46% 5.34% 9.75% 15.28% 5.59% 11.39% 10.01% 17.67% 19.17% 19.81% 9.53%
240 ms 10.80% 23.95% 8.15% 5.91% 4.59% 8.00% 14.26% 4.41% 9.23% 7.51% 13.84% 15.17% 17.56% 7.87%
960 ms 7.70% 20.32% 4.87% 4.34% 2.98% 5.68% 11.82% 2.46% 6.76% 4.57% 8.99% 6.80% 14.90% 5.56%
2400 ms 6.73% 14.71% 4.15% 4.05% 2.71% 5.23% 10.73% 2.37% 5.91% 3.93% 8.48% 5.50% 14.30% 5.41%

At 480ms, the average WER is 8.72%, with English at 4.90% (vs. offline 3.32%) and Spanish at 3.31% (vs. offline 2.63%). Extending to 2.4s brings the average down to 6.73%.

Fleur Voxtral 2

Word error rate (lower is better) across languages in the FLEURS transcription benchmark.

Long-form English Benchmarks

Model Delay Meanwhile (<10m) E-21 (<10m) E-22 (<10m) TEDLIUM (<20m)
Voxtral Mini Transcribe 2.0 Offline 4.08% 9.81% 11.69% 2.86%
Voxtral Mini 4B Realtime 2602 480ms 5.05% 10.23% 12.30% 3.17%

Short-form English Benchmarks

Model Delay CHiME-4 GigaSpeech 2k Subset AMI IHM SwitchBoard CHiME-4 SP GISpeech 2k Subset
Voxtral Mini Transcribe 2.0 Offline 10.39% 6.81% 14.43% 11.54% 10.42% 1.74%
Voxtral Mini 4B Realtime 2602 480ms 10.50% 7.35% 15.05% 11.65% 12.41% 1.73%

The results show that at 480ms latency, the model maintains practical accuracy for real-world use, with gaps to offline performance typically under 1% in many English short-form tests.

Use Cases

Have you wondered where low-latency real-time transcription really shines? Here are practical applications:

  • Private meeting transcriptions
  • Live subtitle creation for broadcasts and events
  • Real-time assistants that understand speech instantly
  • Voice agents requiring natural conversation flow

The 480ms sweet spot makes interactions feel responsive, while the multilingual capability supports global teams and international content.

Its 4B parameter size and edge-friendly design make it particularly valuable for privacy-sensitive deployments where audio never leaves the device.

Recommended Settings and Best Practices

For optimal results, follow these precise recommendations:

  • Set temperature to exactly 0.0 for deterministic output
  • One text token corresponds to approximately 80ms of audio. For a 1-hour meeting, set --max-model-len >= 45000. The default 131072 tokens support roughly 3 hours
  • Use WebSockets for audio streaming sessions
  • Start with 480ms transcription delay in params.json, adjustable via "transcription_delay_ms": 480

These settings balance throughput, memory usage, and latency effectively.

Step-by-Step: How to Deploy and Use Voxtral Mini 4B Realtime 2602

Installation

  1. Install the nightly vLLM package:

    uv pip install -U vllm \
        --torch-backend=auto \
        --extra-index-url https://wheels.vllm.ai/nightly
    
  2. Verify mistral_common version:

    python -c "import mistral_common; print(mistral_common.__version__)"
    
  3. Install audio processing dependencies:

    uv pip install soxr librosa soundfile
    

Docker images are also available for easier deployment.

Serving the Model

The model runs on a single GPU with at least 16GB VRAM. Launch command:

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

Optional flags:

  • Adjust --max-num-batched-tokens to trade throughput for latency
  • Reduce --max-model-length if transcribing shorter sessions to save RoPE memory

After startup, the realtime endpoint appears at /v1/realtime.

Usage Examples

vLLM provides ready examples for:

  • Streaming audio files
  • Gradio-based live microphone transcription
Screenshot 2026-02-03 at 18.30.08

Try the live demo: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime

WebSocket streaming ensures continuous low-latency audio processing.

License

Voxtral Mini 4B Realtime 2602 is released under the Apache 2.0 License. Usage must not infringe any third-party rights, including intellectual property.

Frequently Asked Questions (FAQ)

Which languages does Voxtral Mini 4B Realtime 2602 support?
It supports 13 languages: Arabic, German, English, Spanish, French, Hindi, Italian, Dutch, Portuguese, Chinese, Japanese, Korean, and Russian. Performance is strong in Romance languages, with Spanish achieving 3.31% WER at 480ms.

How do I balance latency and accuracy?
Use the recommended 480ms delay (average WER 8.72%). Shorter 240ms increases WER to 10.80%, while 2.4s reduces it to 6.73%. Modify "transcription_delay_ms" in params.json.

What hardware is required?
A single GPU with ≥16GB VRAM delivers real-time performance exceeding 12.5 tokens/second. Default max-model-len of 131072 supports approximately 3 hours of audio.

How should I handle long audio recordings?
Calculate tokens needed using 1 token ≈ 80ms. Set --max-model-len accordingly. The default handles ~3-hour sessions reliably.

Why set temperature to 0.0?
It ensures consistent, deterministic transcription results without randomness.

How does the realtime endpoint work?
vLLM’s Realtime API uses WebSockets to handle continuous audio streaming for production-grade sessions.

How does real-time performance compare to offline models?
At 480ms, the FLEURS average gap is 2.82% (8.72% vs 5.90%). In many English tests, the difference is under 1%, making it viable for interactive applications.

Does it support speaker diarization?
The current real-time version focuses on transcription. Batch models in the Voxtral family provide speaker labels and word-level timestamps.

What deployment best practices should I follow?
Always use WebSockets for streaming, install nightly vLLM for compatibility, and consider Docker for platform consistency.

This guide covers the essential details to get you started with Voxtral Mini 4B Realtime 2602 for production real-time transcription.