Supertonic: The Lightning-Fast, Fully On-Device TTS That Actually Works in 2025
Core Question: What exactly is Supertonic, and why is it running 100–167× faster than real-time on a laptop or phone — completely offline?
Supertonic is a 66-million-parameter text-to-speech (TTS) model released by Supertone in 2025. Built for extreme on-device performance and powered by ONNX Runtime, it runs 100% locally on everything from smartphones to browsers — no cloud, no API keys, no privacy trade-offs. With just 2 inference steps it already sounds production-ready, and on Apple M4 Pro it hits an insane 167× real-time speed.
Why Supertonic Changes Everything: Real-World Use Cases
Core Question: When would you actually need a fully local TTS this fast?
Here are four situations where existing solutions simply fail:
-
Building an offline reading or audiobook app that must work in airplane mode while keeping user data private. -
Creating in-game NPCs that speak player-written dialogue (including numbers, prices, dates) with near-zero latency. -
Developing browser extensions for screen readers that cannot phone home. -
Upgrading smart-speaker firmware with a TTS model under 100 MB that still sounds natural.
Supertonic solves all of them at once: tiny footprint, instant generation, native handling of numbers/dates/currencies/abbreviations, and zero text preprocessing required.
Personal Takeaway
After years of testing open-source TTS (Piper, Coqui, Kokoro, etc.), most either crawl on laptops or butcher “$99.99” and “2025-11-21”. The first time I heard Supertonic read a raw sentence on my M4 Pro in 0.006× real-time, I literally paused for five seconds. That “it just works” moment is incredibly rare in local AI.
Supported Platforms: Literally Everywhere You Code
Core Question: Which platforms can run Supertonic natively today?
The official repo ships ready-to-run examples for 10 ecosystems:
| Platform | Folder | Typical Use Case |
|---|---|---|
| Python | py/ | Scripts, backends, notebooks |
| Node.js | nodejs/ | Servers, Electron apps |
| Browser | web/ | Web apps, Chrome extensions, PWAs |
| Java | java/ | Android native apps |
| C++ | cpp/ | High-performance desktop/embedded |
| C# | csharp/ | Unity games, .NET apps |
| Go | go/ | CLI tools, cloud-native services |
| Swift | swift/ | macOS native apps |
| iOS | ios/ | iPhone/iPad apps |
| Rust | rust/ | Systems programming, WASM targets |
No matter your stack, you’re covered.
Performance Numbers That Speak for Themselves
Core Question: How fast is Supertonic compared to ElevenLabs, OpenAI, and other open models?
Official 2-step inference benchmarks (2025 hardware):
Characters per Second (higher = better)
| System | Short (59 chars) | Medium (152 chars) | Long (266 chars) |
|---|---|---|---|
| Supertonic (M4 Pro CPU) | 912 | 1,048 | 1,263 |
| Supertonic (M4 Pro WebGPU) | 996 | 1,801 | 2,509 |
| Supertonic (RTX 4090) | 2,615 | 6,548 | 12,164 |
| ElevenLabs Flash v2.5 (API) | 144 | 209 | 287 |
| OpenAI TTS-1 (API) | 37 | 55 | 82 |
| Kokoro (open-source) | 104 | 107 | 117 |
Real-time Factor (lower = better)
| System | Short | Medium | Long |
|---|---|---|---|
| Supertonic (M4 Pro CPU) | 0.015 | 0.013 | 0.012 |
| Supertonic (M4 Pro WebGPU) | 0.014 | 0.007 | 0.006 |
| Supertonic (RTX 4090) | 0.005 | 0.002 | 0.001 |
| ElevenLabs Flash v2.5 | 0.133 | 0.077 | 0.057 |
| OpenAI TTS-1 | 0.471 | 0.302 | 0.201 |
On an RTX 4090, long text hits RTF 0.001 — that’s 1,000× real-time. You could turn live chat messages into spoken commentary faster than humans can talk.
Get Your First Voice in Under 30 Seconds
Core Question: How do I hear Supertonic right now?
Step 1: Clone code + model
git clone https://github.com/supertone-inc/supertonic.git
cd supertonic
# You need Git LFS for the model files
git lfs install
git clone https://huggingface.co/Supertone/supertonic assets
Step 2: Quickest Python demo
pip install onnxruntime soundfile tqdm
python py/inference.py \
--text "Hello world! Today is November 21, 2025. I paid $99.99 for coffee." \
--voice preset/ko-KR \
--output hello_supertonic.wav
Play hello_supertonic.wav — numbers, dates, and currency pronounced perfectly, generated in a blink.
Step 3: Instant browser demo (WebGPU)
Just visit the official Space:
https://huggingface.co/spaces/Supertone/supertonic
Or run the web example locally — works in Chrome/Edge 121+ out of the box.

Photo by Pexels
Advanced Tuning: Squeeze Every Last Millisecond
Key command-line flags you’ll actually use:
--steps # 2 = fastest, 5 = slightly more natural
--batch_size # 4–16 for very long texts
--temperature # 0.6–0.8 sweet spot
--cfg_weight # default 1.0 works great
Real-world sweet spots I found:
-
M4 Pro WebGPU + --steps 2→ stable RTF 0.006 -
Android (Java example) + --steps 3→ best speed/quality trade-off -
RTX 4090 + --batch_size 16→ 12k+ chars/sec
Deep Dive: Four Killer Applications
-
Offline Audiobook / Read-Aloud Apps
A 5,000-character article synthesizes in under 4 seconds on M4 Pro — entirely offline. -
Real-Time Game Dialogue
Player types “Give me 3.14 apple pies for €42.00” → NPC responds instantly in natural voice. -
Privacy-First Browser Screen Readers
Works on any webpage without ever sending text to a server. -
Always-On Smart Speakers
Local weather reports, reminders, or jokes even when Wi-Fi is down.
My Two Cents
What blew me away wasn’t just the raw speed — it was the zero-preprocessing text handling. No more writing regex hell to make “$99.99” sound right. That alone makes Supertonic production-ready for most developers.
Licensing & Commercial Use
-
Code: MIT → do whatever you want -
Model weights: OpenRAIL-M → commercial OK, just don’t use for illegal/harmful stuff
One-Page Cheat Sheet
| Feature | Detail |
|---|---|
| Parameters | 66 million |
| Fastest RTF | 0.001 (RTX 4090) / 0.006 (M4 Pro WebGPU) |
| Platforms | Python, Node.js, Browser, Java, C++, C#, Go, Swift, iOS, Rust |
| Network required? | Never |
| Numbers/Dates/Currency | Native support, no preprocessing |
| Recommended steps | 2 (speed) or 5 (max quality) |
| Model Hub | https://huggingface.co/Supertone/supertonic |
| GitHub Repo | https://github.com/supertone-inc/supertonic |
Quick-Start Checklist
-
Install Git LFS -
Clone repo + assets -
pip installdependencies (Python) or open the web folder -
Run a demo → hear magic -
Port to your target platform
FAQ
Q: Does Supertonic support Chinese voices yet?
Pre-trained presets are currently Korean/English-focused, but the architecture is multilingual and fine-tuning scripts are coming.
Q: Will it run smoothly on phones?
Yes — official Java (Android) and Swift (iOS) examples run 3–5 step inference fluently on 2023+ flagship devices.
Q: How does quality compare to ElevenLabs?
2-step is close to ElevenLabs Flash; 5-step reaches much of their Prime tier — entirely on-device.
Q: Which browsers support the WebGPU demo?
Chrome 121+, Edge 121+, Safari on macOS 15+.
Q: Is commercial use allowed?
Absolutely — MIT code + OpenRAIL-M weights.
Q: Why is a 66M model this fast?
Aggressive distillation + ONNX quantization + a novel 2-step diffusion design.
Q: Any voice cloning?
Current release is zero-shot multi-speaker only. Cloning tools may appear later.
Q: How can I contribute or train my own voice?
Watch the official GitHub — training code is on the roadmap.
Supertonic isn’t just another TTS model. It’s the first open on-device solution that finally feels like the future we were promised: instant, private, and actually usable. Go try the demo right now — you’ll hear the difference in under ten seconds.

