MiMo-Audio 7B: The Open-Source Voice Model That Learns New Tricks From Just a Few Clips

“

Imagine giving an AI three seconds of a podcast intro and having it continue the conversation—same host, same room tone, same energy—without ever being trained on that show.
Xiaomi’s MiMo-Audio team open-sourced a 7-billion-parameter model that does exactly this (and more) after compressing 100 million hours of raw speech. Below is the full story, translated into plain English and kept strictly to the facts published in their paper, blog, and code.

1. What problem is MiMo-Audio trying to solve?

Most voice AI tools today are one-trick ponies:

A great text-to-speech (TTS) engine can’t transcribe.
A solid speech-to-text (STT) model can’t clone a voice.
If you need a new task, you re-collect data and fine-tune.

Humans don’t work that way. We hear two examples and immediately “get” the pattern. GPT-3 showed that scaling text-only training produces this few-shot ability for words. Xiaomi asked: Does the same trick work for sound?

Their answer: Yes—if you keep every piece of acoustic detail and train big enough.

2. Meet the family (all weights are public)

Model	Parameters	Purpose	HF Repo
MiMo-Audio-Tokenizer	1.2 B	Turns 24 kHz waveforms into 200 tokens/second	link
MiMo-Audio-7B-Base	7 B	Pre-trained foundation (few-shot only)	link
MiMo-Audio-7B-Instruct	7 B	Instruction-tuned for chat & TTS	link
MiMo-Audio-Eval toolkit	–	Reproduces all paper numbers	link

Everything is Apache 2.0, including the tokenizer and evaluation sets.

3. Core skills in one glance

Speech Continuation
Give a 3-second prompt → model keeps talking in the same voice, mood, and room acoustics.
Few-shot Voice Tasks (no gradients, 16 examples max)
- Voice conversion
- Emotion swap
- Speed change
- Denoising
- Language translation (speech in, speech out)
Instruct TTS
Type a sentence + plain-text description (“angry reporter, loud, fast”) → spoken audio with that exact style.
ASR & Translation
Librispeech test-clean WER 3.76 %; AISHELL-1 WER 1.78 %; 16-shot En-Zh speech-to-speech translation works out-of-the-box.

4. How the architecture keeps acoustic details alive

Problem: A single second of speech produces ~200 discrete tokens. Feeding that straight into an LLM creates context lengths that explode.

Solution: Three components in series

Patch Encoder – Groups every 4 consecutive frames into one patch → 6.25 Hz representation.
MiMo-7B LLM backbone – Sees text OR patch embeddings, identical next-token training objective.
Patch Decoder – Expands 6.25 Hz patches back to 25 Hz, using delayed RVQ generation (layer 0 starts first, layer 1 lags 1 step, etc.) to keep quality high.

The tokenizer is trained from scratch on 10 M hours of audio with two losses at once:

Reconstruction L1 (multi-scale mel)
Text-alignment (next-token prediction on transcript)

This joint loss forces the same codebook to carry both semantic and acoustic information, removing the usual trade-off.

5. Training pipeline (numbers straight from the paper)

Stage	Data Size	Mix	Goal	Text Loss	Audio Loss
1. Understanding	2.6 T tokens	1.2 T text + 1.4 T speech	Learn cross-modal alignment	✔	✘
2. Joint U+G	5 T tokens	2.6 T text + 2.4 T speech	Add generation ability	✔	✔ (weighted 12-8-6-4-2-2-1-1 across RVQ layers)

Batch size: 16.8 M tokens per step
Context: 8 192 patches (≈ 20 min of audio)
Learning rates: Patch modules 2e-4, LLM 3e-5, cosine decay

After stage-2 the base model is frozen; instruction tuning uses 100 B new tokens (ASR, TTS, dialog, InstructTTS, etc.) to produce the instruct checkpoint.

6. Data recipe: 100 million hours without human labels

Collection

Public podcasts, audiobooks, news, live streams, conference recordings
Language-id filter keeps 95 % Chinese/English, but no transcript is discarded
Speaker diarization → segments ≤ 30 s

Automatic labelling (key to scaling)

Semantic side: Force-align transcripts, run LLM scorer → knowledge density, logic, topic tags
Acoustic side: In-house caption model writes free-text descriptions (“male, cheerful, slight echo, office background”)

Curation
Drop noisy, unsafe, or < 2 kbps bitrate clips; re-sample high-description-score clips 3× to boost quality.

7. Evaluation highlights (all reproducible with the public toolkit)

7.1 SpeechMMLU (general knowledge, 8 549 QA pairs)

Model	Text→Text	Speech→Speech	Speech→Text	Text→Speech
MiMo-7B-Base	72.5	69.1	69.5	71.5
Step-Audio2-mini	74.1	51.8	67.8	63.4
Kimi-Audio-Base	70.7	11.8	67.9	0.0

Modality gap (text vs speech) = 3.4 points, smallest among peers.

7.2 MMAU (audio understanding across speech, sound, music)

Model	Speech	Sound	Music	Overall
MiMo-Instruct	68.5	82.6	73.7	74.9
Gemini-2.5-Flash	76.6	73.3	65.6	71.8
Qwen2.5-Omni	70.6	78.1	65.9	71.5

Best open-source score; beats Google’s closed Gemini on sound & music.

7.3 Spoken dialogue (Big-Bench-Audio)

Model	Speech→Text	Speech→Speech
gpt-4o-audio	70.2	67.2
MiMo-Instruct	72.9	60.2
Step-Audio2-mini	50.9	47.5

7.4 Instruct TTS (human preference on InstructTTSEval)

Model	EN Overall	ZH Overall
MiMo-Instruct	72.6	70.5
gpt-4o-mini-tts	68.5	51.1

8. Quick start: local demo in 15 minutes

Hardware:

GPU with ≥ 16 GB VRAM (FP16) or ≥ 32 GB system RAM for CPU-off-load
Ubuntu 20.04 / Windows 11 + Python 3.10 tested

Install

git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
cd MiMo-Audio
pip install -e .
# optional but faster
pip install flash-attn --no-build-isolation

Download weights

huggingface-cli download XiaomiMiMo/MiMo-Audio-Tokenizer  --local-dir ./tokenizer
huggingface-cli download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./instruct

Launch Gradio UI

python run_mimo_audio.py --tokenizer_path ./tokenizer --model_path ./instruct

Open http://127.0.0.1:7860, drop an audio file or type text, choose the style prompt, and click Generate.

9. Sample commands for developers

Few-shot voice conversion

python inference_example_pretrain.py \
  --task fewshot_vc \
  --examples_dir ./16_pairs_same_sentence/ \
  --target_wav newSpeaker.wav \
  --output out.wav

Instruct TTS with emotion

python inference_example_sft.py \
  --task instruct_tts \
  --text "Welcome to our beautiful city where history meets modernity." \
  --instruction "Enthusiastic female tour-guide, fast tempo, slight British accent" \
  --out welcome.wav

Speech continuation (base model)

python inference_example_pretrain.py \
  --task continuation \
  --prompt prompt.wav \
  --length 15 \
  --out continuation.wav

10. Known limitations (quoted from authors)

In-context quality drops when background music is loud or multiple overlapping events exist.
Dialogue mode can show timbre drift, mis-pronounce symbols, or ignore part of the style prompt.
Thinking chain helps speech tasks but slightly hurts music/sound scores because of hallucinated reasoning.
Next steps: RL fine-tuning, larger music corpus, and streaming architecture.

11. Take-away in one sentence

MiMo-Audio delivers GPT-3-style few-shot generalisation to the voice domain—keep the acoustic details, scale the data, and a 7 B model can suddenly clone voices, translate speech, follow style instructions, and improvise a podcast without ever seeing those tasks at training time.

If you build audiobooks, games, dubbing pipelines, or just want an open alternative to closed cloud APIs, download the weights and try the Gradio demo today—the install commands above really are the whole setup.