1. Six questions engineers always ask first

Question Quick answer
1. What is FunAudio-ASR? A production-first speech-to-text engine that couples a 0.7 B audio encoder with a 7 B LLM, then tunes the stack with reinforcement learning.
2. How is it better than Whisper? On real-world data collected after June-30 the average WER drops ≈ 20–30 % relative. It also streams at ≈ 200 ms and lets you inject domain hot-words on the fly.
3. Can I ship it today? Yes. The repo ships a Docker image, a Gradio demo, and a documented HTTP API. No license fee is mentioned in the report.
4. Minimum hardware? Nano version: RTX 3060 16 GB, 4-core CPU, 16 GB RAM for 1-way real-time stream. Full version: one A100 80 GB for 30-way concurrent.
5. Known weak spots? Long audios (> 5 min) need an external VAD; far-field/multi-channel not released; low-resource languages (Thai, Viet, Indonesian) lag behind Chinese & English.
6. Do I have to re-train? Not for plain Mandarin/English tasks. Download the checkpoint, add a 50-line TSV hot-word list, and call the REST endpoint.

2. Why another LLM-based recogniser?

Classic ASR = acoustic model + lexicon + language model + decoder.
LLM ASR = audio encoder → adaptor → large language model → text.

Pros

  • One neural set, no hand-built lexicon.
  • The LLM has seen billions of sentences → better context.

Cons

  • Hallucinates when audio is noisy or silent.
  • Autoregressive generation = latency risk.
  • Needs gigantic paired data.

FunAudio-ASR keeps the pros and attacks the cons with data scaling + model scaling + RL fine-tuning + product-side hacks (streaming, noise robustness, hot-word, code-switch).


3. System diagram in one glance

Figure-2 from the report
(Caption identical to the paper: high-level block diagram)

Block Size Job
Audio Encoder 0.7 B (full) or 0.2 B (nano) Turns 80-dimensional mel-filter-bank into dense vectors.
Audio Adaptor 2 transformer layers Projects speech vectors into the same space as the text LLM.
CTC Head lightweight Produces a first-pass greedy hypothesis for hot-word retrieval.
LLM Decoder 7 B (full) or 0.6 B (nano) Emits the final token sequence; can see hot-word candidates + previous context.

4. Data pipeline: how many hours and how clean?

Stage Volume Source & tricks
Pre-train ~10 M h 90 % unlabelled web crawl; 10 % labelled by Paraformer-V2, Whisper, SenseVoice; ITN normalisation; VAD clipping.
SFT ~1 M h Human transcripts + TTS synthesis (CosyVoice3) + noise augmentation + simulated streaming chunks.
RL 100 k h Hard cases (three models disagree), long audio (> 20 s), hallucination examples, keyword-heavy utterances.

5. Training recipe: five sequential stages

Stage Data Frozen modules Unfrozen modules Purpose
1 20 k h Encoder + LLM Adaptor Align speech ↔ text spaces.
2 10 M h LLM Encoder + Adaptor Learn robust acoustic features.
3 20 k h Encoder + Adaptor LLM (LoRA) Keep generative power, avoid catastrophic forgetting.
4 3 M h Encoder + Adaptor + LLM (LoRA) Push accuracy on high-quality transcripts.
5 100 k h Encoder CTC head Greedy hypothesis for later hot-word retrieval.

After stage 5 the encoder weights are locked; only the CTC head is trained.


6. Making it production-grade: six engineering add-ons

  1. Streaming

    • Training: cut utterances into 80 ms chunks, feed only leftward context.
    • Inference: Torch encoder → CPU, then SGLang rollout → GPU; switch overhead < 6 %.
  2. Noise robustness

    • 110 k h clean speech + 10 k h noise → 110 k h mixed (SNR 10 dB ± 5 dB).
    • Online mix-up during SFT: 30 % of mini-batches.
    • Result: 13 % relative WER reduction in cafeteria, subway, supermarket.
  3. Code-switching (Chinese ↔ English)

    • Collect 40 k frequent English terms.
    • Prompt Qwen3 to create CS sentences → TTS → human check.
    • Test set A WER falls from 4.53 % → 1.59 %.
  4. Hot-word customisation

    • Build a phoneme / word-piece vocabulary.
    • CTC hypothesis → edit-distance retrieval → top-K candidates fed to LLM.
    • Recall on “names” set improves from 0.75 → 1.0.
  5. Hallucination suppression

    • Add zero-padded pure-noise segments during training; label is empty.
    • RL reward penalises regex-detected hallucinations.
    • Real-world false-transcription rate −35 %.
  6. Multilingual extension

    • Down-sample Chinese & English, up-sample Vietnamese, Thai, Indonesian.
    • 500 k h balanced set; same five-stage training.
    • CommonVoice Thai WER 1.44 % vs Whisper-large 5.92 %.

7. Numbers: benchmarks vs reality

Open-source sets (WER %)

Dataset Whisper-large-v3 Seed-ASR FunAudio-ASR Nano
AIShell-1 4.72 0.68 0.54 1.22
Librispeech-clean 1.86 1.58 1.57 1.94
Fleurs-zh 5.18 3.43 2.64 3.47

In-house industry sets collected after June-30 (WER %)

Scenario Whisper Seed-ASR FunAudio-ASR Relative drop vs Seed
Complex background 32.57 12.90 11.29 −12 %
Home TV noise 18.17 8.08 5.17 −36 %
English general 18.56 15.65 14.22 −9 %

Streaming vs offline penalty
Average WER increases only 1.3 pt while latency drops from 2–3 s to 0.2 s.


8. Quick start: from Docker to first caption

(Commands copied 1-to-1 from the repo’s README inside the report; no extras added.)

  1. Pull image
docker pull funaudio/pytorch:2.1.0-cuda12.1-devel
  1. Run container
docker run --gpus all -it -p 7860:7860 -v $PWD:/workspace funaudio/pytorch:2.1.0-cuda12.1-devel
  1. Install deps
pip install -r requirements.txt
  1. Download checkpoint
from huggingface_hub import snapshot_download
snapshot_download(repo_id="FunAudio/FunAudio-ASR-nano", local_dir="./model")
  1. Launch streaming service
python -m funaudio.service --model ./model --hotword ./my_hotwords.tsv --port 7860

Browse http://localhost:7860 for a live demo page.

  1. Hot-word TSV format (UTF-8, tab-separated)
biology 生物
CRISPR  基因编辑
ZhangSan 张叁

9. Frequently-asked questions (FAQ)

Q1: Do I need an NVIDIA card?
A: Nano runs on RTX 3060 16 GB; CPU fallback is not released yet.

Q2: Can the model run on Android?
A: Not today. INT8 quantisation and NNAPI delegate are on the roadmap.

Q3: How long does a complete fine-tune take?
A: Stage 4 (3 M h) needs 8×A100 80 GB × 15 days. Most teams stop at LoRA stage 3 (20 k h, 8 h on one A100).

Q4: Is there a cloud API SLA?
A: The paper does not mention SLAs; the downloadable weight is Apache-2.0.

Q5: My audio is 48 kHz stereo. What should I do?
A: Down-sample to 16 kHz mono first. The encoder was trained on 16 kHz only.


10. Known limitations (straight from section 7)

  1. Optimised mainly for Chinese & English; other languages still improving.
  2. Context window ~ 5 min; longer meetings need external VAD.
  3. Single-channel, close-talk only. Far-field/multi-channel is lab-grade.

11. Take-away

FunAudio-ASR does not just chase leaderboard WER; it trades a tiny accuracy gap (1.3 pt) for huge real-world gains: streaming latency ×10 smaller, hot-word recall ×1.3 higher, hallucination rate −35 %.
If you need Mandarin/English recognition that works in cafeterias, taxis, living rooms—and you want full stack control—this checkpoint is worth a serious A/B test.