1. Six questions engineers always ask first
| Question | Quick answer | 
|---|---|
| 1. What is FunAudio-ASR? | A production-first speech-to-text engine that couples a 0.7 B audio encoder with a 7 B LLM, then tunes the stack with reinforcement learning. | 
| 2. How is it better than Whisper? | On real-world data collected after June-30 the average WER drops ≈ 20–30 % relative. It also streams at ≈ 200 ms and lets you inject domain hot-words on the fly. | 
| 3. Can I ship it today? | Yes. The repo ships a Docker image, a Gradio demo, and a documented HTTP API. No license fee is mentioned in the report. | 
| 4. Minimum hardware? | Nano version: RTX 3060 16 GB, 4-core CPU, 16 GB RAM for 1-way real-time stream. Full version: one A100 80 GB for 30-way concurrent. | 
| 5. Known weak spots? | Long audios (> 5 min) need an external VAD; far-field/multi-channel not released; low-resource languages (Thai, Viet, Indonesian) lag behind Chinese & English. | 
| 6. Do I have to re-train? | Not for plain Mandarin/English tasks. Download the checkpoint, add a 50-line TSV hot-word list, and call the REST endpoint. | 
2. Why another LLM-based recogniser?
Classic ASR = acoustic model + lexicon + language model + decoder.
LLM ASR = audio encoder → adaptor → large language model → text.
Pros
- 
One neural set, no hand-built lexicon. 
- 
The LLM has seen billions of sentences → better context. 
Cons
- 
Hallucinates when audio is noisy or silent. 
- 
Autoregressive generation = latency risk. 
- 
Needs gigantic paired data. 
FunAudio-ASR keeps the pros and attacks the cons with data scaling + model scaling + RL fine-tuning + product-side hacks (streaming, noise robustness, hot-word, code-switch).
3. System diagram in one glance

(Caption identical to the paper: high-level block diagram)
| Block | Size | Job | 
|---|---|---|
| Audio Encoder | 0.7 B (full) or 0.2 B (nano) | Turns 80-dimensional mel-filter-bank into dense vectors. | 
| Audio Adaptor | 2 transformer layers | Projects speech vectors into the same space as the text LLM. | 
| CTC Head | lightweight | Produces a first-pass greedy hypothesis for hot-word retrieval. | 
| LLM Decoder | 7 B (full) or 0.6 B (nano) | Emits the final token sequence; can see hot-word candidates + previous context. | 
4. Data pipeline: how many hours and how clean?
| Stage | Volume | Source & tricks | 
|---|---|---|
| Pre-train | ~10 M h | 90 % unlabelled web crawl; 10 % labelled by Paraformer-V2, Whisper, SenseVoice; ITN normalisation; VAD clipping. | 
| SFT | ~1 M h | Human transcripts + TTS synthesis (CosyVoice3) + noise augmentation + simulated streaming chunks. | 
| RL | 100 k h | Hard cases (three models disagree), long audio (> 20 s), hallucination examples, keyword-heavy utterances. | 
5. Training recipe: five sequential stages
| Stage | Data | Frozen modules | Unfrozen modules | Purpose | 
|---|---|---|---|---|
| 1 | 20 k h | Encoder + LLM | Adaptor | Align speech ↔ text spaces. | 
| 2 | 10 M h | LLM | Encoder + Adaptor | Learn robust acoustic features. | 
| 3 | 20 k h | Encoder + Adaptor | LLM (LoRA) | Keep generative power, avoid catastrophic forgetting. | 
| 4 | 3 M h | — | Encoder + Adaptor + LLM (LoRA) | Push accuracy on high-quality transcripts. | 
| 5 | 100 k h | Encoder | CTC head | Greedy hypothesis for later hot-word retrieval. | 
After stage 5 the encoder weights are locked; only the CTC head is trained.
6. Making it production-grade: six engineering add-ons
- 
Streaming - 
Training: cut utterances into 80 ms chunks, feed only leftward context. 
- 
Inference: Torch encoder → CPU, then SGLang rollout → GPU; switch overhead < 6 %. 
 
- 
- 
Noise robustness - 
110 k h clean speech + 10 k h noise → 110 k h mixed (SNR 10 dB ± 5 dB). 
- 
Online mix-up during SFT: 30 % of mini-batches. 
- 
Result: 13 % relative WER reduction in cafeteria, subway, supermarket. 
 
- 
- 
Code-switching (Chinese ↔ English) - 
Collect 40 k frequent English terms. 
- 
Prompt Qwen3 to create CS sentences → TTS → human check. 
- 
Test set A WER falls from 4.53 % → 1.59 %. 
 
- 
- 
Hot-word customisation - 
Build a phoneme / word-piece vocabulary. 
- 
CTC hypothesis → edit-distance retrieval → top-K candidates fed to LLM. 
- 
Recall on “names” set improves from 0.75 → 1.0. 
 
- 
- 
Hallucination suppression - 
Add zero-padded pure-noise segments during training; label is empty. 
- 
RL reward penalises regex-detected hallucinations. 
- 
Real-world false-transcription rate −35 %. 
 
- 
- 
Multilingual extension - 
Down-sample Chinese & English, up-sample Vietnamese, Thai, Indonesian. 
- 
500 k h balanced set; same five-stage training. 
- 
CommonVoice Thai WER 1.44 % vs Whisper-large 5.92 %. 
 
- 
7. Numbers: benchmarks vs reality
Open-source sets (WER %)
| Dataset | Whisper-large-v3 | Seed-ASR | FunAudio-ASR | Nano | 
|---|---|---|---|---|
| AIShell-1 | 4.72 | 0.68 | 0.54 | 1.22 | 
| Librispeech-clean | 1.86 | 1.58 | 1.57 | 1.94 | 
| Fleurs-zh | 5.18 | 3.43 | 2.64 | 3.47 | 
In-house industry sets collected after June-30 (WER %)
| Scenario | Whisper | Seed-ASR | FunAudio-ASR | Relative drop vs Seed | 
|---|---|---|---|---|
| Complex background | 32.57 | 12.90 | 11.29 | −12 % | 
| Home TV noise | 18.17 | 8.08 | 5.17 | −36 % | 
| English general | 18.56 | 15.65 | 14.22 | −9 % | 
Streaming vs offline penalty
Average WER increases only 1.3 pt while latency drops from 2–3 s to 0.2 s.
8. Quick start: from Docker to first caption
(Commands copied 1-to-1 from the repo’s README inside the report; no extras added.)
- 
Pull image 
docker pull funaudio/pytorch:2.1.0-cuda12.1-devel
- 
Run container 
docker run --gpus all -it -p 7860:7860 -v $PWD:/workspace funaudio/pytorch:2.1.0-cuda12.1-devel
- 
Install deps 
pip install -r requirements.txt
- 
Download checkpoint 
from huggingface_hub import snapshot_download
snapshot_download(repo_id="FunAudio/FunAudio-ASR-nano", local_dir="./model")
- 
Launch streaming service 
python -m funaudio.service --model ./model --hotword ./my_hotwords.tsv --port 7860
Browse http://localhost:7860 for a live demo page.
- 
Hot-word TSV format (UTF-8, tab-separated) 
biology 生物
CRISPR  基因编辑
ZhangSan 张叁
9. Frequently-asked questions (FAQ)
Q1: Do I need an NVIDIA card?
A: Nano runs on RTX 3060 16 GB; CPU fallback is not released yet.
Q2: Can the model run on Android?
A: Not today. INT8 quantisation and NNAPI delegate are on the roadmap.
Q3: How long does a complete fine-tune take?
A: Stage 4 (3 M h) needs 8×A100 80 GB × 15 days. Most teams stop at LoRA stage 3 (20 k h, 8 h on one A100).
Q4: Is there a cloud API SLA?
A: The paper does not mention SLAs; the downloadable weight is Apache-2.0.
Q5: My audio is 48 kHz stereo. What should I do?
A: Down-sample to 16 kHz mono first. The encoder was trained on 16 kHz only.
10. Known limitations (straight from section 7)
- 
Optimised mainly for Chinese & English; other languages still improving. 
- 
Context window ~ 5 min; longer meetings need external VAD. 
- 
Single-channel, close-talk only. Far-field/multi-channel is lab-grade. 
11. Take-away
FunAudio-ASR does not just chase leaderboard WER; it trades a tiny accuracy gap (1.3 pt) for huge real-world gains: streaming latency ×10 smaller, hot-word recall ×1.3 higher, hallucination rate −35 %.
If you need Mandarin/English recognition that works in cafeterias, taxis, living rooms—and you want full stack control—this checkpoint is worth a serious A/B test.

