WhisperLiveKit: Real-Time On-Device Speech-to-Text with Speaker Diarization & Zero Cloud Uploads

高效码农

5 months ago

WhisperLiveKit: Real-Time, On-Device Speech-to-Text with Speaker Diarization

“Can I transcribe meetings in real time without uploading any audio or paying a cloud bill?”
WhisperLiveKit answers: yes—just one command and your browser.

1. What Exactly Is WhisperLiveKit?

WhisperLiveKit is a small open-source package that bundles:

A ready-to-run backend that listens to your microphone stream and returns text.
A web page that you open in any browser to see the words appear as you speak.
Everything stays on your computer—no audio ever leaves the network card.

Core capabilities (all included)

Capability	What it does	Typical use
Real-time transcription	Converts speech to text while you talk	Meeting notes, lecture captions
Speaker diarization	Labels who is speaking	Interview minutes, customer-service logs
Voice activity detection (VAD)	Ignores silence to save CPU	Long recordings
Ultra-low latency option (SimulStreaming)	2025 research, AlignAtt policy	Live-stream subtitles
Multi-user support	One server can serve many browsers	Small-office deployment

2. Ten-Minute Quick Start

2.1 Install the system dependency: FFmpeg

FFmpeg converts raw microphone data into the format the model expects.

OS	How to install
Ubuntu / Debian	`sudo apt install ffmpeg`
macOS	`brew install ffmpeg`
Windows	Download the binary from ffmpeg.org and add it to your PATH

2.2 Install the Python package

pip install whisperlivekit

Need speaker labels too?

pip install whisperlivekit[diarization]

2.3 Start the server

whisperlivekit-server --model tiny.en

You will see a line like Uvicorn running on http://localhost:8000.

2.4 Open your browser

Visit http://localhost:8000, allow microphone access, and start talking—text appears in real time.

First run? The tiny.en model (~75 MB) downloads automatically.

3. Going Deeper: Configuration Cheat Sheet

3.1 Frequently used command-line flags

Flag	Default	Purpose
`--model`	`tiny`	Larger models are slower but more accurate: tiny < base < small < medium < large
`--language`	`en`	Set `auto` for automatic language detection
`--diarization`	False	Requires Hugging Face model access (see below)
`--backend`	`faster-whisper`	Swap to `simulstreaming` for ultra-low latency
`--host` / `--port`	`localhost` / 8000	Change to `0.0.0.0` if you want LAN access

Run whisperlivekit-server --help for the full list.

3.2 Ultra-low-latency demos with SimulStreaming

whisperlivekit-server \
  --backend simulstreaming \
  --model large-v3 \
  --frame-threshold 20

Lower frame-threshold = faster reaction, slightly lower accuracy.
Note: .en monolingual models do not work with SimulStreaming.

4. Speaker Diarization in Three Steps

The “who said what” feature uses pre-trained models hosted on Hugging Face that require you to accept their license.

Log in to Hugging Face and accept the license for each model:
- pyannote/segmentation
- pyannote/segmentation-3.0
- pyannote/embedding
Log in from the terminal
```
huggingface-cli login
```

Launch the server

whisperlivekit-server --model medium --diarization

When it works, every JSON message includes a speaker field:

{"text": "Let's meet at three tomorrow.", "speaker": "A"}

5. Embedding WhisperLiveKit into Your Own Python Project

The repo contains a minimal example called basic_server.py. The idea is simple:

Create one global TranscriptionEngine (it is heavy; create it once).
For every new WebSocket connection, create an AudioProcessor that feeds the engine.
Stream the results back to the browser.

Minimal working snippet:

from whisperlivekit import TranscriptionEngine, AudioProcessor
from fastapi import FastAPI, WebSocket

engine = TranscriptionEngine(model="medium", diarization=True)

@app.websocket("/asr")
async def asr(websocket: WebSocket):
    await websocket.accept()
    audio_proc = AudioProcessor(transcription_engine=engine)
    async for result in audio_proc.start():
        await websocket.send_json(result)

6. Frontend: Ready-Made HTML

The package ships with live_transcription.html. It already handles:

Microphone permission prompt
Auto-reconnect on network hiccups
Real-time subtitle styling
Colored speaker labels when diarization is on

You can also import the HTML directly in Python:

from whisperlivekit import get_web_interface_html
html = get_web_interface_html()

7. Production Deployment: From Laptop to Server

7.1 Run with Gunicorn for multiple workers

pip install uvicorn gunicorn
gunicorn -k uvicorn.workers.UvicornWorker -w 4 main:app

-w 4 spins up four workers—good for a four-core machine.

7.2 Reverse proxy with Nginx

server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

For HTTPS, add your certificate or use the built-in flags --ssl-certfile and --ssl-keyfile.

8. Run in Docker with One Command

The included Dockerfile already has FFmpeg and common extras baked in.

# Build the image
docker build -t whisperlivekit .

# Run with NVIDIA GPU
docker run --gpus all -p 8000:8000 whisperlivekit --model base

# CPU-only machines: omit --gpus all

Build arguments

Argument	Example	Purpose
EXTRAS	`whisper-timestamped`	Add extra Python dependencies
HF_PRECACHE_DIR	`./.cache/`	Cache models at build time for faster start-ups
HF_TKN_FILE	`./token`	Bake your Hugging Face token into the image

9. Frequently Asked Questions

Q1: Does it need an internet connection?
No. After the models are downloaded everything runs locally, except when you explicitly fetch new models.

Q2: Which languages are supported?
Whisper supports 99 languages. Remove .en from the model name and set --language auto for automatic detection.

Q3: How low is the latency?

faster-whisper + tiny on a desktop: ~300–500 ms
simulstreaming + large-v3: under 200 ms
Latency depends on model size, CPU/GPU, and chunk size (--min-chunk-size).

Q4: No speech is detected—what now?
Check microphone permissions, or disable VAD with --no-vad for testing.
You may also lower --min-chunk-size to 0.5 s.

Q5: Can I install it completely offline?
Yes. Run once on a connected machine, then copy the folders ~/.cache/huggingface and ~/.cache/whisper to the offline machine.

Q6: Is it free for commercial use?
The code is MIT-licensed. The SimulStreaming backend is dual-licensed; open-source projects remain free, closed-source products need a commercial license.

10. Use-Case Inspiration

Scenario	Recommended flags	Extra tips
Small meeting room	`--model medium --diarization`	Use a 360° microphone for better separation
Live-stream captions	`--backend simulstreaming --frame-threshold 15`	Subtitles lead by ~200 ms, almost imperceptible
Customer-service QA	`--language zh --diarization`	Pipe the `speaker` field into your CRM
Voice-note app	Embed the HTML in a Flutter WebView	Bridge native mic to WebSocket

11. Takeaway: Why It’s Worth a Try

Zero friction: one command and any browser works.
Fully on-device: your audio never leaves your network.
Extensible: Python API, Docker, Nginx—mix and match.
State-of-the-art: 2025 research (SimulStreaming, Streaming Sortformer) ready to use today.

If you need a “hears, labels, and reacts fast” speech-to-text stack that runs entirely on your own hardware, WhisperLiveKit has already built the bridge—your only remaining task is to speak.