1. Six questions engineers always ask first
Question | Quick answer |
---|---|
1. What is FunAudio-ASR? | A production-first speech-to-text engine that couples a 0.7 B audio encoder with a 7 B LLM, then tunes the stack with reinforcement learning. |
2. How is it better than Whisper? | On real-world data collected after June-30 the average WER drops ≈ 20–30 % relative. It also streams at ≈ 200 ms and lets you inject domain hot-words on the fly. |
3. Can I ship it today? | Yes. The repo ships a Docker image, a Gradio demo, and a documented HTTP API. No license fee is mentioned in the report. |
4. Minimum hardware? | Nano version: RTX 3060 16 GB, 4-core CPU, 16 GB RAM for 1-way real-time stream. Full version: one A100 80 GB for 30-way concurrent. |
5. Known weak spots? | Long audios (> 5 min) need an external VAD; far-field/multi-channel not released; low-resource languages (Thai, Viet, Indonesian) lag behind Chinese & English. |
6. Do I have to re-train? | Not for plain Mandarin/English tasks. Download the checkpoint, add a 50-line TSV hot-word list, and call the REST endpoint. |
2. Why another LLM-based recogniser?
Classic ASR = acoustic model + lexicon + language model + decoder.
LLM ASR = audio encoder → adaptor → large language model → text.
Pros
-
One neural set, no hand-built lexicon. -
The LLM has seen billions of sentences → better context.
Cons
-
Hallucinates when audio is noisy or silent. -
Autoregressive generation = latency risk. -
Needs gigantic paired data.
FunAudio-ASR keeps the pros and attacks the cons with data scaling + model scaling + RL fine-tuning + product-side hacks (streaming, noise robustness, hot-word, code-switch).
3. System diagram in one glance
(Caption identical to the paper: high-level block diagram)
Block | Size | Job |
---|---|---|
Audio Encoder | 0.7 B (full) or 0.2 B (nano) | Turns 80-dimensional mel-filter-bank into dense vectors. |
Audio Adaptor | 2 transformer layers | Projects speech vectors into the same space as the text LLM. |
CTC Head | lightweight | Produces a first-pass greedy hypothesis for hot-word retrieval. |
LLM Decoder | 7 B (full) or 0.6 B (nano) | Emits the final token sequence; can see hot-word candidates + previous context. |
4. Data pipeline: how many hours and how clean?
Stage | Volume | Source & tricks |
---|---|---|
Pre-train | ~10 M h | 90 % unlabelled web crawl; 10 % labelled by Paraformer-V2, Whisper, SenseVoice; ITN normalisation; VAD clipping. |
SFT | ~1 M h | Human transcripts + TTS synthesis (CosyVoice3) + noise augmentation + simulated streaming chunks. |
RL | 100 k h | Hard cases (three models disagree), long audio (> 20 s), hallucination examples, keyword-heavy utterances. |
5. Training recipe: five sequential stages
Stage | Data | Frozen modules | Unfrozen modules | Purpose |
---|---|---|---|---|
1 | 20 k h | Encoder + LLM | Adaptor | Align speech ↔ text spaces. |
2 | 10 M h | LLM | Encoder + Adaptor | Learn robust acoustic features. |
3 | 20 k h | Encoder + Adaptor | LLM (LoRA) | Keep generative power, avoid catastrophic forgetting. |
4 | 3 M h | — | Encoder + Adaptor + LLM (LoRA) | Push accuracy on high-quality transcripts. |
5 | 100 k h | Encoder | CTC head | Greedy hypothesis for later hot-word retrieval. |
After stage 5 the encoder weights are locked; only the CTC head is trained.
6. Making it production-grade: six engineering add-ons
-
Streaming
-
Training: cut utterances into 80 ms chunks, feed only leftward context. -
Inference: Torch encoder → CPU, then SGLang rollout → GPU; switch overhead < 6 %.
-
-
Noise robustness
-
110 k h clean speech + 10 k h noise → 110 k h mixed (SNR 10 dB ± 5 dB). -
Online mix-up during SFT: 30 % of mini-batches. -
Result: 13 % relative WER reduction in cafeteria, subway, supermarket.
-
-
Code-switching (Chinese ↔ English)
-
Collect 40 k frequent English terms. -
Prompt Qwen3 to create CS sentences → TTS → human check. -
Test set A WER falls from 4.53 % → 1.59 %.
-
-
Hot-word customisation
-
Build a phoneme / word-piece vocabulary. -
CTC hypothesis → edit-distance retrieval → top-K candidates fed to LLM. -
Recall on “names” set improves from 0.75 → 1.0.
-
-
Hallucination suppression
-
Add zero-padded pure-noise segments during training; label is empty. -
RL reward penalises regex-detected hallucinations. -
Real-world false-transcription rate −35 %.
-
-
Multilingual extension
-
Down-sample Chinese & English, up-sample Vietnamese, Thai, Indonesian. -
500 k h balanced set; same five-stage training. -
CommonVoice Thai WER 1.44 % vs Whisper-large 5.92 %.
-
7. Numbers: benchmarks vs reality
Open-source sets (WER %)
Dataset | Whisper-large-v3 | Seed-ASR | FunAudio-ASR | Nano |
---|---|---|---|---|
AIShell-1 | 4.72 | 0.68 | 0.54 | 1.22 |
Librispeech-clean | 1.86 | 1.58 | 1.57 | 1.94 |
Fleurs-zh | 5.18 | 3.43 | 2.64 | 3.47 |
In-house industry sets collected after June-30 (WER %)
Scenario | Whisper | Seed-ASR | FunAudio-ASR | Relative drop vs Seed |
---|---|---|---|---|
Complex background | 32.57 | 12.90 | 11.29 | −12 % |
Home TV noise | 18.17 | 8.08 | 5.17 | −36 % |
English general | 18.56 | 15.65 | 14.22 | −9 % |
Streaming vs offline penalty
Average WER increases only 1.3 pt while latency drops from 2–3 s to 0.2 s.
8. Quick start: from Docker to first caption
(Commands copied 1-to-1 from the repo’s README inside the report; no extras added.)
-
Pull image
docker pull funaudio/pytorch:2.1.0-cuda12.1-devel
-
Run container
docker run --gpus all -it -p 7860:7860 -v $PWD:/workspace funaudio/pytorch:2.1.0-cuda12.1-devel
-
Install deps
pip install -r requirements.txt
-
Download checkpoint
from huggingface_hub import snapshot_download
snapshot_download(repo_id="FunAudio/FunAudio-ASR-nano", local_dir="./model")
-
Launch streaming service
python -m funaudio.service --model ./model --hotword ./my_hotwords.tsv --port 7860
Browse http://localhost:7860
for a live demo page.
-
Hot-word TSV format (UTF-8, tab-separated)
biology 生物
CRISPR 基因编辑
ZhangSan 张叁
9. Frequently-asked questions (FAQ)
Q1: Do I need an NVIDIA card?
A: Nano runs on RTX 3060 16 GB; CPU fallback is not released yet.
Q2: Can the model run on Android?
A: Not today. INT8 quantisation and NNAPI delegate are on the roadmap.
Q3: How long does a complete fine-tune take?
A: Stage 4 (3 M h) needs 8×A100 80 GB × 15 days. Most teams stop at LoRA stage 3 (20 k h, 8 h on one A100).
Q4: Is there a cloud API SLA?
A: The paper does not mention SLAs; the downloadable weight is Apache-2.0.
Q5: My audio is 48 kHz stereo. What should I do?
A: Down-sample to 16 kHz mono first. The encoder was trained on 16 kHz only.
10. Known limitations (straight from section 7)
-
Optimised mainly for Chinese & English; other languages still improving. -
Context window ~ 5 min; longer meetings need external VAD. -
Single-channel, close-talk only. Far-field/multi-channel is lab-grade.
11. Take-away
FunAudio-ASR does not just chase leaderboard WER; it trades a tiny accuracy gap (1.3 pt) for huge real-world gains: streaming latency ×10 smaller, hot-word recall ×1.3 higher, hallucination rate −35 %.
If you need Mandarin/English recognition that works in cafeterias, taxis, living rooms—and you want full stack control—this checkpoint is worth a serious A/B test.