Site icon Efficient Coder

WhisperLiveKit: Real-Time On-Device Speech-to-Text with Speaker Diarization & Zero Cloud Uploads

WhisperLiveKit: Real-Time, On-Device Speech-to-Text with Speaker Diarization

“Can I transcribe meetings in real time without uploading any audio or paying a cloud bill?”
WhisperLiveKit answers: yes—just one command and your browser.


1. What Exactly Is WhisperLiveKit?

WhisperLiveKit is a small open-source package that bundles:

  • A ready-to-run backend that listens to your microphone stream and returns text.
  • A web page that you open in any browser to see the words appear as you speak.
  • Everything stays on your computer—no audio ever leaves the network card.

Core capabilities (all included)

Capability What it does Typical use
Real-time transcription Converts speech to text while you talk Meeting notes, lecture captions
Speaker diarization Labels who is speaking Interview minutes, customer-service logs
Voice activity detection (VAD) Ignores silence to save CPU Long recordings
Ultra-low latency option (SimulStreaming) 2025 research, AlignAtt policy Live-stream subtitles
Multi-user support One server can serve many browsers Small-office deployment

2. Ten-Minute Quick Start

2.1 Install the system dependency: FFmpeg

FFmpeg converts raw microphone data into the format the model expects.

OS How to install
Ubuntu / Debian sudo apt install ffmpeg
macOS brew install ffmpeg
Windows Download the binary from ffmpeg.org and add it to your PATH

2.2 Install the Python package

pip install whisperlivekit

Need speaker labels too?

pip install whisperlivekit[diarization]

2.3 Start the server

whisperlivekit-server --model tiny.en

You will see a line like Uvicorn running on http://localhost:8000.

2.4 Open your browser

Visit http://localhost:8000, allow microphone access, and start talking—text appears in real time.

First run? The tiny.en model (~75 MB) downloads automatically.


3. Going Deeper: Configuration Cheat Sheet

3.1 Frequently used command-line flags

Flag Default Purpose
--model tiny Larger models are slower but more accurate: tiny < base < small < medium < large
--language en Set auto for automatic language detection
--diarization False Requires Hugging Face model access (see below)
--backend faster-whisper Swap to simulstreaming for ultra-low latency
--host / --port localhost / 8000 Change to 0.0.0.0 if you want LAN access

Run whisperlivekit-server --help for the full list.

3.2 Ultra-low-latency demos with SimulStreaming

whisperlivekit-server \
  --backend simulstreaming \
  --model large-v3 \
  --frame-threshold 20
  • Lower frame-threshold = faster reaction, slightly lower accuracy.
  • Note: .en monolingual models do not work with SimulStreaming.

4. Speaker Diarization in Three Steps

The “who said what” feature uses pre-trained models hosted on Hugging Face that require you to accept their license.

  1. Log in to Hugging Face and accept the license for each model:

    • pyannote/segmentation
    • pyannote/segmentation-3.0
    • pyannote/embedding
  2. Log in from the terminal

    huggingface-cli login
    
  3. Launch the server

    whisperlivekit-server --model medium --diarization
    

When it works, every JSON message includes a speaker field:

{"text": "Let's meet at three tomorrow.", "speaker": "A"}

5. Embedding WhisperLiveKit into Your Own Python Project

The repo contains a minimal example called basic_server.py. The idea is simple:

  1. Create one global TranscriptionEngine (it is heavy; create it once).
  2. For every new WebSocket connection, create an AudioProcessor that feeds the engine.
  3. Stream the results back to the browser.

Minimal working snippet:

from whisperlivekit import TranscriptionEngine, AudioProcessor
from fastapi import FastAPI, WebSocket

engine = TranscriptionEngine(model="medium", diarization=True)

@app.websocket("/asr")
async def asr(websocket: WebSocket):
    await websocket.accept()
    audio_proc = AudioProcessor(transcription_engine=engine)
    async for result in audio_proc.start():
        await websocket.send_json(result)

6. Frontend: Ready-Made HTML

The package ships with live_transcription.html. It already handles:

  • Microphone permission prompt
  • Auto-reconnect on network hiccups
  • Real-time subtitle styling
  • Colored speaker labels when diarization is on

You can also import the HTML directly in Python:

from whisperlivekit import get_web_interface_html
html = get_web_interface_html()

7. Production Deployment: From Laptop to Server

7.1 Run with Gunicorn for multiple workers

pip install uvicorn gunicorn
gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app

-w 4 spins up four workers—good for a four-core machine.

7.2 Reverse proxy with Nginx

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

For HTTPS, add your certificate or use the built-in flags --ssl-certfile and --ssl-keyfile.


8. Run in Docker with One Command

The included Dockerfile already has FFmpeg and common extras baked in.

# Build the image
docker build -t whisperlivekit .

# Run with NVIDIA GPU
docker run --gpus all -p 8000:8000 whisperlivekit --model base

# CPU-only machines: omit --gpus all

Build arguments

Argument Example Purpose
EXTRAS whisper-timestamped Add extra Python dependencies
HF_PRECACHE_DIR ./.cache/ Cache models at build time for faster start-ups
HF_TKN_FILE ./token Bake your Hugging Face token into the image

9. Frequently Asked Questions

Q1: Does it need an internet connection?
No. After the models are downloaded everything runs locally, except when you explicitly fetch new models.

Q2: Which languages are supported?
Whisper supports 99 languages. Remove .en from the model name and set --language auto for automatic detection.

Q3: How low is the latency?

  • faster-whisper + tiny on a desktop: ~300–500 ms
  • simulstreaming + large-v3: under 200 ms
    Latency depends on model size, CPU/GPU, and chunk size (--min-chunk-size).

Q4: No speech is detected—what now?
Check microphone permissions, or disable VAD with --no-vad for testing.
You may also lower --min-chunk-size to 0.5 s.

Q5: Can I install it completely offline?
Yes. Run once on a connected machine, then copy the folders ~/.cache/huggingface and ~/.cache/whisper to the offline machine.

Q6: Is it free for commercial use?
The code is MIT-licensed. The SimulStreaming backend is dual-licensed; open-source projects remain free, closed-source products need a commercial license.


10. Use-Case Inspiration

Scenario Recommended flags Extra tips
Small meeting room --model medium --diarization Use a 360° microphone for better separation
Live-stream captions --backend simulstreaming --frame-threshold 15 Subtitles lead by ~200 ms, almost imperceptible
Customer-service QA --language zh --diarization Pipe the speaker field into your CRM
Voice-note app Embed the HTML in a Flutter WebView Bridge native mic to WebSocket

11. Takeaway: Why It’s Worth a Try

  • Zero friction: one command and any browser works.
  • Fully on-device: your audio never leaves your network.
  • Extensible: Python API, Docker, Nginx—mix and match.
  • State-of-the-art: 2025 research (SimulStreaming, Streaming Sortformer) ready to use today.

If you need a “hears, labels, and reacts fast” speech-to-text stack that runs entirely on your own hardware, WhisperLiveKit has already built the bridge—your only remaining task is to speak.

Exit mobile version