FunAudio-ASR Revealed: The LLM-Powered Speech Recognition Breakthrough for Real-World Applications

1. Six questions engineers always ask first

Question	Quick answer
1. What is FunAudio-ASR?	A production-first speech-to-text engine that couples a 0.7 B audio encoder with a 7 B LLM, then tunes the stack with reinforcement learning.
2. How is it better than Whisper?	On real-world data collected after June-30 the average WER drops ≈ 20–30 % relative. It also streams at ≈ 200 ms and lets you inject domain hot-words on the fly.
3. Can I ship it today?	Yes. The repo ships a Docker image, a Gradio demo, and a documented HTTP API. No license fee is mentioned in the report.
4. Minimum hardware?	Nano version: RTX 3060 16 GB, 4-core CPU, 16 GB RAM for 1-way real-time stream. Full version: one A100 80 GB for 30-way concurrent.
5. Known weak spots?	Long audios (> 5 min) need an external VAD; far-field/multi-channel not released; low-resource languages (Thai, Viet, Indonesian) lag behind Chinese & English.
6. Do I have to re-train?	Not for plain Mandarin/English tasks. Download the checkpoint, add a 50-line TSV hot-word list, and call the REST endpoint.

2. Why another LLM-based recogniser?

Classic ASR = acoustic model + lexicon + language model + decoder.
LLM ASR = audio encoder → adaptor → large language model → text.

Pros

One neural set, no hand-built lexicon.
The LLM has seen billions of sentences → better context.

Cons

Hallucinates when audio is noisy or silent.
Autoregressive generation = latency risk.
Needs gigantic paired data.

FunAudio-ASR keeps the pros and attacks the cons with data scaling + model scaling + RL fine-tuning + product-side hacks (streaming, noise robustness, hot-word, code-switch).

3. System diagram in one glance

Figure-2 from the report
(Caption identical to the paper: high-level block diagram)

Block	Size	Job
Audio Encoder	0.7 B (full) or 0.2 B (nano)	Turns 80-dimensional mel-filter-bank into dense vectors.
Audio Adaptor	2 transformer layers	Projects speech vectors into the same space as the text LLM.
CTC Head	lightweight	Produces a first-pass greedy hypothesis for hot-word retrieval.
LLM Decoder	7 B (full) or 0.6 B (nano)	Emits the final token sequence; can see hot-word candidates + previous context.

4. Data pipeline: how many hours and how clean?

Stage	Volume	Source & tricks
Pre-train	~10 M h	90 % unlabelled web crawl; 10 % labelled by Paraformer-V2, Whisper, SenseVoice; ITN normalisation; VAD clipping.
SFT	~1 M h	Human transcripts + TTS synthesis (CosyVoice3) + noise augmentation + simulated streaming chunks.
RL	100 k h	Hard cases (three models disagree), long audio (> 20 s), hallucination examples, keyword-heavy utterances.

5. Training recipe: five sequential stages

Stage	Data	Frozen modules	Unfrozen modules	Purpose
1	20 k h	Encoder + LLM	Adaptor	Align speech ↔ text spaces.
2	10 M h	LLM	Encoder + Adaptor	Learn robust acoustic features.
3	20 k h	Encoder + Adaptor	LLM (LoRA)	Keep generative power, avoid catastrophic forgetting.
4	3 M h	—	Encoder + Adaptor + LLM (LoRA)	Push accuracy on high-quality transcripts.
5	100 k h	Encoder	CTC head	Greedy hypothesis for later hot-word retrieval.

After stage 5 the encoder weights are locked; only the CTC head is trained.

6. Making it production-grade: six engineering add-ons

Streaming
- Training: cut utterances into 80 ms chunks, feed only leftward context.
- Inference: Torch encoder → CPU, then SGLang rollout → GPU; switch overhead < 6 %.
Noise robustness
- 110 k h clean speech + 10 k h noise → 110 k h mixed (SNR 10 dB ± 5 dB).
- Online mix-up during SFT: 30 % of mini-batches.
- Result: 13 % relative WER reduction in cafeteria, subway, supermarket.
Code-switching (Chinese ↔ English)
- Collect 40 k frequent English terms.
- Prompt Qwen3 to create CS sentences → TTS → human check.
- Test set A WER falls from 4.53 % → 1.59 %.
Hot-word customisation
- Build a phoneme / word-piece vocabulary.
- CTC hypothesis → edit-distance retrieval → top-K candidates fed to LLM.
- Recall on “names” set improves from 0.75 → 1.0.
Hallucination suppression
- Add zero-padded pure-noise segments during training; label is empty.
- RL reward penalises regex-detected hallucinations.
- Real-world false-transcription rate −35 %.
Multilingual extension
- Down-sample Chinese & English, up-sample Vietnamese, Thai, Indonesian.
- 500 k h balanced set; same five-stage training.
- CommonVoice Thai WER 1.44 % vs Whisper-large 5.92 %.

7. Numbers: benchmarks vs reality

Open-source sets (WER %)

Dataset	Whisper-large-v3	Seed-ASR	FunAudio-ASR	Nano
AIShell-1	4.72	0.68	0.54	1.22
Librispeech-clean	1.86	1.58	1.57	1.94
Fleurs-zh	5.18	3.43	2.64	3.47

In-house industry sets collected after June-30 (WER %)

Scenario	Whisper	Seed-ASR	FunAudio-ASR	Relative drop vs Seed
Complex background	32.57	12.90	11.29	−12 %
Home TV noise	18.17	8.08	5.17	−36 %
English general	18.56	15.65	14.22	−9 %

Streaming vs offline penalty
Average WER increases only 1.3 pt while latency drops from 2–3 s to 0.2 s.

8. Quick start: from Docker to first caption

(Commands copied 1-to-1 from the repo’s README inside the report; no extras added.)

Pull image

docker pull funaudio/pytorch:2.1.0-cuda12.1-devel

Run container

docker run --gpus all -it -p 7860:7860 -v $PWD:/workspace funaudio/pytorch:2.1.0-cuda12.1-devel

Install deps

pip install -r requirements.txt

Download checkpoint

from huggingface_hub import snapshot_download
snapshot_download(repo_id="FunAudio/FunAudio-ASR-nano", local_dir="./model")

Launch streaming service

python -m funaudio.service --model ./model --hotword ./my_hotwords.tsv --port 7860

Browse http://localhost:7860 for a live demo page.

Hot-word TSV format (UTF-8, tab-separated)

biology 生物
CRISPR  基因编辑
ZhangSan 张叁

9. Frequently-asked questions (FAQ)

Q1: Do I need an NVIDIA card?
A: Nano runs on RTX 3060 16 GB; CPU fallback is not released yet.

Q2: Can the model run on Android?
A: Not today. INT8 quantisation and NNAPI delegate are on the roadmap.

Q3: How long does a complete fine-tune take?
A: Stage 4 (3 M h) needs 8×A100 80 GB × 15 days. Most teams stop at LoRA stage 3 (20 k h, 8 h on one A100).

Q4: Is there a cloud API SLA?
A: The paper does not mention SLAs; the downloadable weight is Apache-2.0.

Q5: My audio is 48 kHz stereo. What should I do?
A: Down-sample to 16 kHz mono first. The encoder was trained on 16 kHz only.

10. Known limitations (straight from section 7)

Optimised mainly for Chinese & English; other languages still improving.
Context window ~ 5 min; longer meetings need external VAD.
Single-channel, close-talk only. Far-field/multi-channel is lab-grade.

11. Take-away

FunAudio-ASR does not just chase leaderboard WER; it trades a tiny accuracy gap (1.3 pt) for huge real-world gains: streaming latency ×10 smaller, hot-word recall ×1.3 higher, hallucination rate −35 %.
If you need Mandarin/English recognition that works in cafeterias, taxis, living rooms—and you want full stack control—this checkpoint is worth a serious A/B test.