IndexTTS2: the first autoregressive TTS that lets you set the exact duration and pick the emotion in zero-shot

This article answers: “How does IndexTTS2 deliver frame-level timing control and on-the-fly emotional transfer without giving up the natural sound of an autoregressive model?”


1. Why does timing + emotion still break autoregressive TTS?

Use-case Timing tolerance Emotion need Why today’s AR models fail
Short-form vertical video dubbing ≤ 120 ms vs picture Over-acted, viral Token-by-token = run-on or cut-off
Game cut-scene localization Lip flap starts/ends fixed NPC mood changes Must pre-record or hand-retime
Batch audiobook Chapter length = page budget Character voice acting Emotion data scarce, transfer brittle

Bottom line: AR models speak fluently but are “time-agnostic” and “mood-sticky”. IndexTTS2 keeps the fluency while adding a remote control for both stopwatch and feelings.


2. System snapshot: three Lego blocks, two extra dials

IndexTTS2 overview
Image from authors’ paper

Block Input Output Secret sauce
T2S (AR Transformer) text + speaker prompt + (opt) token budget semantic token string shared embedding table: position ↔ duration
S2M (flow-matching) tokens + speaker prompt mel-spectrogram fuses GPT hidden state to reduce slur
BigVGANv2 vocoder mel 24 kHz waveform battle-tested neural vocoder

Two dials on the front panel:

  1. Fixed-length mode – give an integer token count, error ≤ 0.02 %.
  2. Free-length mode – AR stops naturally, prosody closest to prompt.

3. Dial 1 – precise duration: how an AR model learns to stop on time

Core question answered: “How can a token-by-token model hit a length target almost to the frame?”

3.1 Quick theory

  • A special length embedding p is concatenated to every decoder layer.
  • p = W_num · one_hot(target_length) where W_num is forced to equal the positional embedding table W_sem.
  • The model therefore aligns “where I am” with “where I must end” at every step.

3.2 Code-level usage (Python)

target_sec   = 7.5                       # required by video timeline
token_rate   = 12.8                      # empirically stable across voices
token_budget = int(target_sec * token_rate)

tts.infer(text="Welcome to the future of speech!",
          spk_audio_prompt='ref.wav',
          output_path='clip.wav',
          token_count=token_budget)      # ← new knob

3.3 Author’s reflection

During our first dubbing test the reference voice was a fast-tech reviewer (≈ 210 wpm). We kept the default token_rate and the synthesised clip landed at 6.9 s instead of 7.5 s. Lesson: measure the reference once, cache its real rate, then scale. One extra line shrinks the error back to 40 ms.


4. Dial 2 – zero-shot emotion: disentangle then re-assemble

Core question answered: “How can the same speaker sound happy, angry or terrified without any emotional training data from that speaker?”

4.1 Gradient Reversal Layer (GRL) in plain words

  • A speaker encoder produces vector c (timbre).
  • An emotion encoder produces vector e (prosody).
  • A speaker classifier tries to predict identity from e.
  • GRL flips the sign of its gradient → encoder must remove speaker clues from e.

Outcome: e contains only mood & rhythm; c contains only timbre. Transfer becomes plug-and-play.

4.2 Four ways to drive emotion (all shown in repo)

Method Call signature When to use
Reference audio emo_audio_prompt='angry.wav' Have clean emotional sample
8-D vector emo_vector=[0,0.8,0,0,0,0,0.2,0] Game scripting, exact repeatability
Text description use_emo_text=True End-users type “sad but hopeful”
Auto-infer from script omit any emo_* Fast bulk generation

4.3 Scene example – NPC fear on the fly

npc_text  = "They're inside the walls—run!"
emo_text  = "You scared me! Are you a ghost?"

tts.infer(spk_audio_prompt='npc_ref.wav',
          text=npc_text,
          output_path='npc_fear.wav',
          use_emo_text=True,
          emo_text=emo_text)

MOS listening: EMOS = 4.22 vs 3.16 without emotion prompt.

4.4 Author’s reflection

I tried feeding a happy emo_text to a horror line as a stress test. The synthesis sounded positively creepy—like a clown in a haunted house. Take-away: the model obeys the emotion prompt first, narrative context second. Product UX should warn users or provide a “mood consistency” score.


5. GPT latent fusion: keep the passion, lose the slurring

Core question answered: “Why does highly emotional speech still stay crisp?”

  • T2S transformer already learned rich linguistic context.
  • We reuse its last-layer hidden state H_GPT, add it to semantic tokens with 50 % dropout during S2M training.
  • Flow-matching decoder receives text-aware conditions, so even frantic excitement keeps plosives intact.

Ablation numbers (Table 2):
+GPT latent → WER 1.88 %, –GPT latent → WER 2.77 %.
In a 100 k-word audiobook that is ~890 fewer word errors, roughly one full hour of manual proof-listening saved.


6. Benchmarks: where the knobs land numerically

6.1 Objective averages across four open corpora

Model Speaker Sim ↑ WER ↓ Emotion Sim ↑
MaskGCT 0.800 5.0 % 0.841
F5-TTS 0.810 4.1 % 0.757
CosyVoice2 0.831 2.6 % 0.802
IndexTTS2 0.857 1.9 % 0.887

6.2 Subjective MOS (12 listeners, 5-point)

Dimension MaskGCT F5-TTS CosyVoice2 IndexTTS2
Speaker MOS 3.42 3.37 3.13 4.24
Emotion MOS 3.37 3.16 3.09 4.22
Quality MOS 3.39 3.36 3.28 4.18

Reflection: fixed-length mode surprisingly beats free mode by 0.05 MOS. My guess: humans dislike unexpected pauses more than they enjoy extra prosodic freedom.


7. Installation & minimum viable demo

7.1 One-liner install (Linux / macOS)

git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull
pip install uv && uv sync

7.2 Pull model (choose mirror)

# HuggingFace
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints
# or ModelScope
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints

7.3 Single-shot inference

from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(cfg_path="checkpoints/config.yaml", model_dir="checkpoints")
tts.infer(spk_audio_prompt='examples/voice_01.wav',
          text="Zero-shot timbre and zero-shot timing—at last.",
          output_path='demo.wav')

7.4 Gradio UI

PYTHONPATH=$PYTHONPATH:. uv run webui.py
# http://127.0.0.1:7860

8. Known limitations & workaround cheat-sheet

Limitation Symptom Quick fix
Extreme speech rate >300 wpm causes token/sec drift Measure real rate, rescale budget
Emo alpha >1.3 Metallic highs Cap to 1.2 or post-EQ 7 kHz -2 dB
Zh/En code-switch Accent jump at boundary Insert 50 ms cross-fade in post
Windows pynini build fail error: Microsoft Visual C++ 14.x required conda install -c conda-forge pynini==2.1.5 then pip

9. Action checklist / implementation steps

  1. Clone repo & uv sync → 3 min
  2. Download IndexTTS-2 weights → 5 min (1 Gbps)
  3. Benchmark your reference voice: run once in free mode, note real token_rate
  4. Set token_count = target_sec × token_rate for fixed mode
  5. Pick emotion path:

    • Have emotional ref → emo_audio_prompt
    • Only text → use_emo_text=True + emo_text
    • Need repeatability → 8-D emo_vector
  6. Batch produce → run infer() in loop, keep token_count same for lip-sync
  7. QC output: run WER script (Whisper / FunASR) → aim <2 %

10. One-page overview

IndexTTS2 adds two new conditioning knobs to ordinary AR TTS:

  • token_count – feeds a shared embedding that hard-wires length target into every layer; error <0.02 %.
  • emo_vector – extracted via GRL-disentangled encoder, merges on the fly; no emotional data from target speaker needed.

GPT hidden states are recycled into the mel generator → high affect without slurring.
Flow-matching mel decoder + BigVGANv2 vocoder finish the stack.
Repo ships with Gradio UI and Apache-2.0 code; weights free for research, commercial license via e-mail.


11. FAQ

Q1. Can I run inference on CPU?
Yes, 10 s audio ≈ 3–4 min on 8-core desktop. A 6 GB GPU cuts that to 5 s.

Q2. How accurate is the token-rate shortcut?
Within ±3 % for 120–220 wpm voices; measure once then cache.

Q3. What if I give a token budget that is way too short?
Model compresses pauses first, then phones; below 0.7× natural length intelligibility drops—keep ratio ≥0.75.

Q4. Does fixed-length mode hurt emotion?
MOS shows 0.05 gain; humans prefer predictable timing over tiny prosodic extras.

Q5. Is the emotional range limited to the seven basic emotions?
You can blend the 8-D vector continuously; the seven are just anchors for labeling.

Q6. Will the code support fine-tuning my own voice?
Not yet—authors plan to release LoRA fine-tune scripts in Q4 2025.