IndexTTS2: the first autoregressive TTS that lets you set the exact duration and pick the emotion in zero-shot This article answers: “How does IndexTTS2 deliver frame-level timing control and on-the-fly emotional transfer without giving up the natural sound of an autoregressive model?” 1. Why does timing + emotion still break autoregressive TTS? Use-case Timing tolerance Emotion need Why today’s AR models fail Short-form vertical video dubbing ≤ 120 ms vs picture Over-acted, viral Token-by-token = run-on or cut-off Game cut-scene localization Lip flap starts/ends fixed NPC mood changes Must pre-record or hand-retime Batch audiobook Chapter length = page budget Character …