GLM-TTS: The New Open-Source Benchmark for Emotional Zero-Shot Chinese TTS

Core question most developers are asking in late 2025: Is there finally a fully open-source TTS that can clone any voice with 3–10 seconds of audio, sound emotional, stream in real-time, and handle Chinese polyphones accurately?
The answer is yes — and it launched today.

On December 11, 2025, Zhipu AI open-sourced GLM-TTS: a production-ready, zero-shot, emotionally expressive text-to-speech system that is currently the strongest open-source Chinese TTS available.

Image credit: Official repository

Why GLM-TTS Changes Everything — In Four Bullet Points

Zero-shot voice cloning: 3–10 s reference audio is enough
Streaming inference: real-time capable on a single GPU
Emotion & prosody: boosted by multi-reward reinforcement learning (RL)
Accurate pronunciation: Phoneme-in control for polyphones and rare characters

Let’s break down each part.

How Does Zero-Shot Cloning Work with Only 3–10 Seconds of Audio?

GLM-TTS uses a clean two-stage design that has become the industry standard but executes it exceptionally well:

Stage 1 – LLM generates discrete speech tokens
A Llama-based autoregressive model takes text + speaker embedding → speech token sequence
Stage 2 – Flow Matching decoder
A Diffusion Transformer (DiT) with Continuous Flow Matching turns tokens into high-quality mel-spectrograms
Vocoder
Currently a high-fidelity Vocoder; 2D-Vocos upgrade is coming soon

The zero-shot magic comes from the speaker embedding extracted by the CAMPPlus model in the frontend. Feed it any 3–10 second WAV, and the system instantly reproduces that voice — no per-speaker finetuning required.

Real-world use case
An audiobook app lets users record a 5-second greeting. The entire novel is then read aloud in the user’s own voice, on the fly.

Why Is Its Emotional Expressiveness a Generation Ahead?

Because it is the first open-source TTS optimized with multi-reward reinforcement learning and GRPO (Group Relative Policy Optimization).

Traditional TTS models are trained only with supervised loss — they learn to “pronounce correctly,” not to “perform.”
GLM-TTS adds several reward models after generation:

Speaker similarity
Character error rate (CER)
Emotional intensity
Natural laughter
Prosodic naturalness

A distributed reward server scores every candidate, and GRPO updates the LLM policy accordingly.

Objective results on the Seed-TTS-Eval Chinese test set (no phoneme mode)

Model	CER ↓	SIM ↑	Fully Open-Source
MiniMax	0.83	78.3	No
VoxCPM	0.93	77.2	Yes
GLM-TTS (base)	1.03	76.1	Yes
GLM-TTS_RL	0.89	76.4	Yes

The RL variant drops CER from 1.03 → 0.89 while keeping (or slightly improving) similarity — a clear leap in naturalness.

Personal take
Scale is no longer the bottleneck in 2025. Teaching a model to be judged (via RL) is the real differentiator. GLM-TTS proves it works for speech.

How Does It Solve Chinese Polyphones and Rare Characters?

Through the Phoneme-in (hybrid phoneme + text) mechanism.

During training, random characters are replaced with their pinyin so the model learns both modalities.
At inference, you can force exact pronunciation:

我是要去行[xíng]李，还是乘航[háng]班？

Just add the desired pinyin in brackets.

One-line activation

python glmtts_inference.py --data=example_zh --phoneme

Use case
Educational assessment platforms or children’s story apps can guarantee “长大” is read as cháng and “生长” as zhǎng — no more embarrassing mistakes.

System Architecture
Image credit: Official repository

How to Run It Locally in Under 10 Minutes

Environment (Python 3.10–3.12)

git clone https://github.com/zai-org/GLM-TTS.git
cd GLM-TTS
pip install -r requirements.txt

Download the full checkpoint (~15 GB)

# Option 1 – Hugging Face (recommended)
huggingface-cli download zai-org/GLM-TTS --local-dir ckpt

# Option 2 – ModelScope
pip install modelscope
modelscope download --model ZhipuAI/GLM-TTS --local-dir ckpt

Quick inference

# One-click script
bash glmtts_inference.sh

# Or manual
python glmtts_inference.py --data=example_zh --exp_name=test --use_cache

Interactive Web UI (Gradio)

python tools/gradio_app.py

Upload your reference audio, type text, tweak temperature/top_p, and listen instantly.

Streaming Performance

Thanks to Flow Matching, streaming is built-in. On an RTX 4090 you get 3–5× real-time speed — perfect for live assistants, game characters, or streaming commentary.

Quick Codebase Map

GLM-TTS/
├── glmtts_inference.py      # Main inference entry
├── llm/glmtts.py            # Llama-based TTS backbone
├── flow/                    # DiT + Flow Matching
├── cosyvoice/cli/frontend.py # Text normalization & speaker embedding
├── grpo/                    # Full RL training code (GRPO, rewards, server)
└── tools/gradio_app.py      # Web demo

One-Page Summary

Feature	Supported	Notes
Zero-shot cloning	Yes	3–10 s reference
Streaming inference	Yes	3–5× RT on 4090
Multi-reward RL	Yes	Best open-source CER 0.89
Phoneme control	Yes	Perfect polyphone handling
Chinese + English mixed	Yes	Native support
Gradio Web UI	Yes	One command launch
License	Apache 2.0 (model)	Reference audio for research only

FAQ

Is 3 seconds of reference really enough?
Yes. 5–10 seconds gives even more stable timbre.
Does it work with pure English text?
Yes, mixed Chinese-English is fully supported; pure English works but Chinese is stronger.
When will the RL-finetuned weights be released?
Marked as “Coming Soon” in the official roadmap.
Can I deploy it offline?
Absolutely — everything is downloadable.
Commercial use restrictions?
Model weights are Apache 2.0. Included prompt audio is research-only; replace with your own for production.
Relationship with CosyVoice?
Shares frontend and vocoder code, but backbone LLM, Flow, and RL are completely new.
VRAM requirements?
~20–24 GB for comfortable inference; streaming runs easily on a single 4090.
Will training code be released?
RL training code is already public; pre-training code not yet.

GLM-TTS is not just “another open-source TTS.”
It is the first fully open model that Chinese developers can take straight to production with emotional, real-time, zero-shot capabilities.

Go star the repo and try it today — 2025 just raised the bar for open Chinese speech synthesis.

https://github.com/zai-org/GLM-TTS

Happy synthesizing! 🚀

GLM-TTS: The First Fully Open-Source TTS for Emotional Chinese Voice Cloning