“
Imagine giving an AI three seconds of a podcast intro and having it continue the conversation—same host, same room tone, same energy—without ever being trained on that show.
Xiaomi’s MiMo-Audio team open-sourced a 7-billion-parameter model that does exactly this (and more) after compressing 100 million hours of raw speech. Below is the full story, translated into plain English and kept strictly to the facts published in their paper, blog, and code.
1. What problem is MiMo-Audio trying to solve?
Most voice AI tools today are one-trick ponies:
-
A great text-to-speech (TTS) engine can’t transcribe. -
A solid speech-to-text (STT) model can’t clone a voice. -
If you need a new task, you re-collect data and fine-tune.
Humans don’t work that way. We hear two examples and immediately “get” the pattern. GPT-3 showed that scaling text-only training produces this few-shot ability for words. Xiaomi asked: Does the same trick work for sound?
Their answer: Yes—if you keep every piece of acoustic detail and train big enough.
2. Meet the family (all weights are public)
Everything is Apache 2.0, including the tokenizer and evaluation sets.
3. Core skills in one glance
-
Speech Continuation
Give a 3-second prompt → model keeps talking in the same voice, mood, and room acoustics. -
Few-shot Voice Tasks (no gradients, 16 examples max)
-
Voice conversion -
Emotion swap -
Speed change -
Denoising -
Language translation (speech in, speech out)
-
-
Instruct TTS
Type a sentence + plain-text description (“angry reporter, loud, fast”) → spoken audio with that exact style. -
ASR & Translation
Librispeech test-clean WER 3.76 %; AISHELL-1 WER 1.78 %; 16-shot En-Zh speech-to-speech translation works out-of-the-box.
4. How the architecture keeps acoustic details alive
Problem: A single second of speech produces ~200 discrete tokens. Feeding that straight into an LLM creates context lengths that explode.
Solution: Three components in series
-
Patch Encoder – Groups every 4 consecutive frames into one patch → 6.25 Hz representation. -
MiMo-7B LLM backbone – Sees text OR patch embeddings, identical next-token training objective. -
Patch Decoder – Expands 6.25 Hz patches back to 25 Hz, using delayed RVQ generation (layer 0 starts first, layer 1 lags 1 step, etc.) to keep quality high.
The tokenizer is trained from scratch on 10 M hours of audio with two losses at once:
-
Reconstruction L1 (multi-scale mel) -
Text-alignment (next-token prediction on transcript)
This joint loss forces the same codebook to carry both semantic and acoustic information, removing the usual trade-off.
5. Training pipeline (numbers straight from the paper)
Batch size: 16.8 M tokens per step
Context: 8 192 patches (≈ 20 min of audio)
Learning rates: Patch modules 2e-4, LLM 3e-5, cosine decay
After stage-2 the base model is frozen; instruction tuning uses 100 B new tokens (ASR, TTS, dialog, InstructTTS, etc.) to produce the instruct checkpoint.
6. Data recipe: 100 million hours without human labels
Collection
-
Public podcasts, audiobooks, news, live streams, conference recordings -
Language-id filter keeps 95 % Chinese/English, but no transcript is discarded -
Speaker diarization → segments ≤ 30 s
Automatic labelling (key to scaling)
-
Semantic side: Force-align transcripts, run LLM scorer → knowledge density, logic, topic tags -
Acoustic side: In-house caption model writes free-text descriptions (“male, cheerful, slight echo, office background”)
Curation
Drop noisy, unsafe, or < 2 kbps bitrate clips; re-sample high-description-score clips 3× to boost quality.
7. Evaluation highlights (all reproducible with the public toolkit)
7.1 SpeechMMLU (general knowledge, 8 549 QA pairs)
Modality gap (text vs speech) = 3.4 points, smallest among peers.
7.2 MMAU (audio understanding across speech, sound, music)
Best open-source score; beats Google’s closed Gemini on sound & music.
7.3 Spoken dialogue (Big-Bench-Audio)
7.4 Instruct TTS (human preference on InstructTTSEval)
8. Quick start: local demo in 15 minutes
Hardware:
-
GPU with ≥ 16 GB VRAM (FP16) or ≥ 32 GB system RAM for CPU-off-load -
Ubuntu 20.04 / Windows 11 + Python 3.10 tested
Install
git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
cd MiMo-Audio
pip install -e .
# optional but faster
pip install flash-attn --no-build-isolation
Download weights
huggingface-cli download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./tokenizer
huggingface-cli download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./instruct
Launch Gradio UI
python run_mimo_audio.py --tokenizer_path ./tokenizer --model_path ./instruct
Open http://127.0.0.1:7860
, drop an audio file or type text, choose the style prompt, and click Generate.
9. Sample commands for developers
Few-shot voice conversion
python inference_example_pretrain.py \
--task fewshot_vc \
--examples_dir ./16_pairs_same_sentence/ \
--target_wav newSpeaker.wav \
--output out.wav
Instruct TTS with emotion
python inference_example_sft.py \
--task instruct_tts \
--text "Welcome to our beautiful city where history meets modernity." \
--instruction "Enthusiastic female tour-guide, fast tempo, slight British accent" \
--out welcome.wav
Speech continuation (base model)
python inference_example_pretrain.py \
--task continuation \
--prompt prompt.wav \
--length 15 \
--out continuation.wav
10. Known limitations (quoted from authors)
-
In-context quality drops when background music is loud or multiple overlapping events exist. -
Dialogue mode can show timbre drift, mis-pronounce symbols, or ignore part of the style prompt. -
Thinking chain helps speech tasks but slightly hurts music/sound scores because of hallucinated reasoning.
Next steps: RL fine-tuning, larger music corpus, and streaming architecture.
11. Take-away in one sentence
MiMo-Audio delivers GPT-3-style few-shot generalisation to the voice domain—keep the acoustic details, scale the data, and a 7 B model can suddenly clone voices, translate speech, follow style instructions, and improvise a podcast without ever seeing those tasks at training time.
If you build audiobooks, games, dubbing pipelines, or just want an open alternative to closed cloud APIs, download the weights and try the Gradio demo today—the install commands above really are the whole setup.