Imagine giving an AI three seconds of a podcast intro and having it continue the conversation—same host, same room tone, same energy—without ever being trained on that show.
Xiaomi’s MiMo-Audio team open-sourced a 7-billion-parameter model that does exactly this (and more) after compressing 100 million hours of raw speech. Below is the full story, translated into plain English and kept strictly to the facts published in their paper, blog, and code.


1. What problem is MiMo-Audio trying to solve?

Most voice AI tools today are one-trick ponies:

  • A great text-to-speech (TTS) engine can’t transcribe.
  • A solid speech-to-text (STT) model can’t clone a voice.
  • If you need a new task, you re-collect data and fine-tune.

Humans don’t work that way. We hear two examples and immediately “get” the pattern. GPT-3 showed that scaling text-only training produces this few-shot ability for words. Xiaomi asked: Does the same trick work for sound?

Their answer: Yes—if you keep every piece of acoustic detail and train big enough.


2. Meet the family (all weights are public)

Model Parameters Purpose HF Repo
MiMo-Audio-Tokenizer 1.2 B Turns 24 kHz waveforms into 200 tokens/second link
MiMo-Audio-7B-Base 7 B Pre-trained foundation (few-shot only) link
MiMo-Audio-7B-Instruct 7 B Instruction-tuned for chat & TTS link
MiMo-Audio-Eval toolkit Reproduces all paper numbers link

Everything is Apache 2.0, including the tokenizer and evaluation sets.


3. Core skills in one glance

  1. Speech Continuation
    Give a 3-second prompt → model keeps talking in the same voice, mood, and room acoustics.

  2. Few-shot Voice Tasks (no gradients, 16 examples max)

    • Voice conversion
    • Emotion swap
    • Speed change
    • Denoising
    • Language translation (speech in, speech out)
  3. Instruct TTS
    Type a sentence + plain-text description (“angry reporter, loud, fast”) → spoken audio with that exact style.

  4. ASR & Translation
    Librispeech test-clean WER 3.76 %; AISHELL-1 WER 1.78 %; 16-shot En-Zh speech-to-speech translation works out-of-the-box.


4. How the architecture keeps acoustic details alive

Problem: A single second of speech produces ~200 discrete tokens. Feeding that straight into an LLM creates context lengths that explode.

Solution: Three components in series

  1. Patch Encoder – Groups every 4 consecutive frames into one patch → 6.25 Hz representation.
  2. MiMo-7B LLM backbone – Sees text OR patch embeddings, identical next-token training objective.
  3. Patch Decoder – Expands 6.25 Hz patches back to 25 Hz, using delayed RVQ generation (layer 0 starts first, layer 1 lags 1 step, etc.) to keep quality high.

The tokenizer is trained from scratch on 10 M hours of audio with two losses at once:

  • Reconstruction L1 (multi-scale mel)
  • Text-alignment (next-token prediction on transcript)

This joint loss forces the same codebook to carry both semantic and acoustic information, removing the usual trade-off.


5. Training pipeline (numbers straight from the paper)

Stage Data Size Mix Goal Text Loss Audio Loss
1. Understanding 2.6 T tokens 1.2 T text + 1.4 T speech Learn cross-modal alignment
2. Joint U+G 5 T tokens 2.6 T text + 2.4 T speech Add generation ability ✔ (weighted 12-8-6-4-2-2-1-1 across RVQ layers)

Batch size: 16.8 M tokens per step
Context: 8 192 patches (≈ 20 min of audio)
Learning rates: Patch modules 2e-4, LLM 3e-5, cosine decay

After stage-2 the base model is frozen; instruction tuning uses 100 B new tokens (ASR, TTS, dialog, InstructTTS, etc.) to produce the instruct checkpoint.


6. Data recipe: 100 million hours without human labels

Collection

  • Public podcasts, audiobooks, news, live streams, conference recordings
  • Language-id filter keeps 95 % Chinese/English, but no transcript is discarded
  • Speaker diarization → segments ≤ 30 s

Automatic labelling (key to scaling)

  • Semantic side: Force-align transcripts, run LLM scorer → knowledge density, logic, topic tags
  • Acoustic side: In-house caption model writes free-text descriptions (“male, cheerful, slight echo, office background”)

Curation
Drop noisy, unsafe, or < 2 kbps bitrate clips; re-sample high-description-score clips 3× to boost quality.


7. Evaluation highlights (all reproducible with the public toolkit)

7.1 SpeechMMLU (general knowledge, 8 549 QA pairs)

Model Text→Text Speech→Speech Speech→Text Text→Speech
MiMo-7B-Base 72.5 69.1 69.5 71.5
Step-Audio2-mini 74.1 51.8 67.8 63.4
Kimi-Audio-Base 70.7 11.8 67.9 0.0

Modality gap (text vs speech) = 3.4 points, smallest among peers.

7.2 MMAU (audio understanding across speech, sound, music)

Model Speech Sound Music Overall
MiMo-Instruct 68.5 82.6 73.7 74.9
Gemini-2.5-Flash 76.6 73.3 65.6 71.8
Qwen2.5-Omni 70.6 78.1 65.9 71.5

Best open-source score; beats Google’s closed Gemini on sound & music.

7.3 Spoken dialogue (Big-Bench-Audio)

Model Speech→Text Speech→Speech
gpt-4o-audio 70.2 67.2
MiMo-Instruct 72.9 60.2
Step-Audio2-mini 50.9 47.5

7.4 Instruct TTS (human preference on InstructTTSEval)

Model EN Overall ZH Overall
MiMo-Instruct 72.6 70.5
gpt-4o-mini-tts 68.5 51.1

8. Quick start: local demo in 15 minutes

Hardware:

  • GPU with ≥ 16 GB VRAM (FP16) or ≥ 32 GB system RAM for CPU-off-load
  • Ubuntu 20.04 / Windows 11 + Python 3.10 tested

Install

git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
cd MiMo-Audio
pip install -e .
# optional but faster
pip install flash-attn --no-build-isolation

Download weights

huggingface-cli download XiaomiMiMo/MiMo-Audio-Tokenizer  --local-dir ./tokenizer
huggingface-cli download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./instruct

Launch Gradio UI

python run_mimo_audio.py --tokenizer_path ./tokenizer --model_path ./instruct

Open http://127.0.0.1:7860, drop an audio file or type text, choose the style prompt, and click Generate.


9. Sample commands for developers

Few-shot voice conversion

python inference_example_pretrain.py \
  --task fewshot_vc \
  --examples_dir ./16_pairs_same_sentence/ \
  --target_wav newSpeaker.wav \
  --output out.wav

Instruct TTS with emotion

python inference_example_sft.py \
  --task instruct_tts \
  --text "Welcome to our beautiful city where history meets modernity." \
  --instruction "Enthusiastic female tour-guide, fast tempo, slight British accent" \
  --out welcome.wav

Speech continuation (base model)

python inference_example_pretrain.py \
  --task continuation \
  --prompt prompt.wav \
  --length 15 \
  --out continuation.wav

10. Known limitations (quoted from authors)

  • In-context quality drops when background music is loud or multiple overlapping events exist.
  • Dialogue mode can show timbre drift, mis-pronounce symbols, or ignore part of the style prompt.
  • Thinking chain helps speech tasks but slightly hurts music/sound scores because of hallucinated reasoning.
    Next steps: RL fine-tuning, larger music corpus, and streaming architecture.

11. Take-away in one sentence

MiMo-Audio delivers GPT-3-style few-shot generalisation to the voice domain—keep the acoustic details, scale the data, and a 7 B model can suddenly clone voices, translate speech, follow style instructions, and improvise a podcast without ever seeing those tasks at training time.

If you build audiobooks, games, dubbing pipelines, or just want an open alternative to closed cloud APIs, download the weights and try the Gradio demo today—the install commands above really are the whole setup.