WhisperLiveKit: Real-Time, On-Device Speech-to-Text with Speaker Diarization
“Can I transcribe meetings in real time without uploading any audio or paying a cloud bill?”
WhisperLiveKit answers: yes—just one command and your browser.
1. What Exactly Is WhisperLiveKit?
WhisperLiveKit is a small open-source package that bundles:
-
A ready-to-run backend that listens to your microphone stream and returns text. -
A web page that you open in any browser to see the words appear as you speak. -
Everything stays on your computer—no audio ever leaves the network card.
Core capabilities (all included)
Capability | What it does | Typical use |
---|---|---|
Real-time transcription | Converts speech to text while you talk | Meeting notes, lecture captions |
Speaker diarization | Labels who is speaking | Interview minutes, customer-service logs |
Voice activity detection (VAD) | Ignores silence to save CPU | Long recordings |
Ultra-low latency option (SimulStreaming) | 2025 research, AlignAtt policy | Live-stream subtitles |
Multi-user support | One server can serve many browsers | Small-office deployment |
2. Ten-Minute Quick Start
2.1 Install the system dependency: FFmpeg
FFmpeg converts raw microphone data into the format the model expects.
OS | How to install |
---|---|
Ubuntu / Debian | sudo apt install ffmpeg |
macOS | brew install ffmpeg |
Windows | Download the binary from ffmpeg.org and add it to your PATH |
2.2 Install the Python package
pip install whisperlivekit
Need speaker labels too?
pip install whisperlivekit[diarization]
2.3 Start the server
whisperlivekit-server --model tiny.en
You will see a line like Uvicorn running on http://localhost:8000
.
2.4 Open your browser
Visit http://localhost:8000
, allow microphone access, and start talking—text appears in real time.
First run? The tiny.en model (~75 MB) downloads automatically.
3. Going Deeper: Configuration Cheat Sheet
3.1 Frequently used command-line flags
Flag | Default | Purpose |
---|---|---|
--model |
tiny |
Larger models are slower but more accurate: tiny < base < small < medium < large |
--language |
en |
Set auto for automatic language detection |
--diarization |
False | Requires Hugging Face model access (see below) |
--backend |
faster-whisper |
Swap to simulstreaming for ultra-low latency |
--host / --port |
localhost / 8000 |
Change to 0.0.0.0 if you want LAN access |
Run whisperlivekit-server --help
for the full list.
3.2 Ultra-low-latency demos with SimulStreaming
whisperlivekit-server \
--backend simulstreaming \
--model large-v3 \
--frame-threshold 20
-
Lower frame-threshold
= faster reaction, slightly lower accuracy. -
Note: .en
monolingual models do not work with SimulStreaming.
4. Speaker Diarization in Three Steps
The “who said what” feature uses pre-trained models hosted on Hugging Face that require you to accept their license.
-
Log in to Hugging Face and accept the license for each model:
-
pyannote/segmentation -
pyannote/segmentation-3.0 -
pyannote/embedding
-
-
Log in from the terminal
huggingface-cli login
-
Launch the server
whisperlivekit-server --model medium --diarization
When it works, every JSON message includes a speaker
field:
{"text": "Let's meet at three tomorrow.", "speaker": "A"}
5. Embedding WhisperLiveKit into Your Own Python Project
The repo contains a minimal example called basic_server.py. The idea is simple:
-
Create one global TranscriptionEngine
(it is heavy; create it once). -
For every new WebSocket connection, create an AudioProcessor
that feeds the engine. -
Stream the results back to the browser.
Minimal working snippet:
from whisperlivekit import TranscriptionEngine, AudioProcessor
from fastapi import FastAPI, WebSocket
engine = TranscriptionEngine(model="medium", diarization=True)
@app.websocket("/asr")
async def asr(websocket: WebSocket):
await websocket.accept()
audio_proc = AudioProcessor(transcription_engine=engine)
async for result in audio_proc.start():
await websocket.send_json(result)
6. Frontend: Ready-Made HTML
The package ships with live_transcription.html. It already handles:
-
Microphone permission prompt -
Auto-reconnect on network hiccups -
Real-time subtitle styling -
Colored speaker labels when diarization is on
You can also import the HTML directly in Python:
from whisperlivekit import get_web_interface_html
html = get_web_interface_html()
7. Production Deployment: From Laptop to Server
7.1 Run with Gunicorn for multiple workers
pip install uvicorn gunicorn
gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app
-w 4
spins up four workers—good for a four-core machine.
7.2 Reverse proxy with Nginx
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
For HTTPS, add your certificate or use the built-in flags --ssl-certfile
and --ssl-keyfile
.
8. Run in Docker with One Command
The included Dockerfile already has FFmpeg and common extras baked in.
# Build the image
docker build -t whisperlivekit .
# Run with NVIDIA GPU
docker run --gpus all -p 8000:8000 whisperlivekit --model base
# CPU-only machines: omit --gpus all
Build arguments
Argument | Example | Purpose |
---|---|---|
EXTRAS | whisper-timestamped |
Add extra Python dependencies |
HF_PRECACHE_DIR | ./.cache/ |
Cache models at build time for faster start-ups |
HF_TKN_FILE | ./token |
Bake your Hugging Face token into the image |
9. Frequently Asked Questions
Q1: Does it need an internet connection?
No. After the models are downloaded everything runs locally, except when you explicitly fetch new models.
Q2: Which languages are supported?
Whisper supports 99 languages. Remove .en
from the model name and set --language auto
for automatic detection.
Q3: How low is the latency?
-
faster-whisper
+tiny
on a desktop: ~300–500 ms -
simulstreaming
+large-v3
: under 200 ms
Latency depends on model size, CPU/GPU, and chunk size (--min-chunk-size
).
Q4: No speech is detected—what now?
Check microphone permissions, or disable VAD with --no-vad
for testing.
You may also lower --min-chunk-size
to 0.5 s.
Q5: Can I install it completely offline?
Yes. Run once on a connected machine, then copy the folders ~/.cache/huggingface
and ~/.cache/whisper
to the offline machine.
Q6: Is it free for commercial use?
The code is MIT-licensed. The SimulStreaming backend is dual-licensed; open-source projects remain free, closed-source products need a commercial license.
10. Use-Case Inspiration
Scenario | Recommended flags | Extra tips |
---|---|---|
Small meeting room | --model medium --diarization |
Use a 360° microphone for better separation |
Live-stream captions | --backend simulstreaming --frame-threshold 15 |
Subtitles lead by ~200 ms, almost imperceptible |
Customer-service QA | --language zh --diarization |
Pipe the speaker field into your CRM |
Voice-note app | Embed the HTML in a Flutter WebView | Bridge native mic to WebSocket |
11. Takeaway: Why It’s Worth a Try
-
Zero friction: one command and any browser works. -
Fully on-device: your audio never leaves your network. -
Extensible: Python API, Docker, Nginx—mix and match. -
State-of-the-art: 2025 research (SimulStreaming, Streaming Sortformer) ready to use today.
If you need a “hears, labels, and reacts fast” speech-to-text stack that runs entirely on your own hardware, WhisperLiveKit has already built the bridge—your only remaining task is to speak.