Supertonic: The Lightning-Fast, Fully On-Device TTS That Actually Works in 2025

Core Question: What exactly is Supertonic, and why is it running 100–167× faster than real-time on a laptop or phone — completely offline?

Supertonic is a 66-million-parameter text-to-speech (TTS) model released by Supertone in 2025. Built for extreme on-device performance and powered by ONNX Runtime, it runs 100% locally on everything from smartphones to browsers — no cloud, no API keys, no privacy trade-offs. With just 2 inference steps it already sounds production-ready, and on Apple M4 Pro it hits an insane 167× real-time speed.

Why Supertonic Changes Everything: Real-World Use Cases

Core Question: When would you actually need a fully local TTS this fast?

Here are four situations where existing solutions simply fail:

  • Building an offline reading or audiobook app that must work in airplane mode while keeping user data private.
  • Creating in-game NPCs that speak player-written dialogue (including numbers, prices, dates) with near-zero latency.
  • Developing browser extensions for screen readers that cannot phone home.
  • Upgrading smart-speaker firmware with a TTS model under 100 MB that still sounds natural.

Supertonic solves all of them at once: tiny footprint, instant generation, native handling of numbers/dates/currencies/abbreviations, and zero text preprocessing required.

Personal Takeaway
After years of testing open-source TTS (Piper, Coqui, Kokoro, etc.), most either crawl on laptops or butcher “$99.99” and “2025-11-21”. The first time I heard Supertonic read a raw sentence on my M4 Pro in 0.006× real-time, I literally paused for five seconds. That “it just works” moment is incredibly rare in local AI.

Supported Platforms: Literally Everywhere You Code

Core Question: Which platforms can run Supertonic natively today?

The official repo ships ready-to-run examples for 10 ecosystems:

Platform Folder Typical Use Case
Python py/ Scripts, backends, notebooks
Node.js nodejs/ Servers, Electron apps
Browser web/ Web apps, Chrome extensions, PWAs
Java java/ Android native apps
C++ cpp/ High-performance desktop/embedded
C# csharp/ Unity games, .NET apps
Go go/ CLI tools, cloud-native services
Swift swift/ macOS native apps
iOS ios/ iPhone/iPad apps
Rust rust/ Systems programming, WASM targets

No matter your stack, you’re covered.

Performance Numbers That Speak for Themselves

Core Question: How fast is Supertonic compared to ElevenLabs, OpenAI, and other open models?

Official 2-step inference benchmarks (2025 hardware):

Characters per Second (higher = better)

System Short (59 chars) Medium (152 chars) Long (266 chars)
Supertonic (M4 Pro CPU) 912 1,048 1,263
Supertonic (M4 Pro WebGPU) 996 1,801 2,509
Supertonic (RTX 4090) 2,615 6,548 12,164
ElevenLabs Flash v2.5 (API) 144 209 287
OpenAI TTS-1 (API) 37 55 82
Kokoro (open-source) 104 107 117

Real-time Factor (lower = better)

System Short Medium Long
Supertonic (M4 Pro CPU) 0.015 0.013 0.012
Supertonic (M4 Pro WebGPU) 0.014 0.007 0.006
Supertonic (RTX 4090) 0.005 0.002 0.001
ElevenLabs Flash v2.5 0.133 0.077 0.057
OpenAI TTS-1 0.471 0.302 0.201

On an RTX 4090, long text hits RTF 0.001 — that’s 1,000× real-time. You could turn live chat messages into spoken commentary faster than humans can talk.

Get Your First Voice in Under 30 Seconds

Core Question: How do I hear Supertonic right now?

Step 1: Clone code + model

git clone https://github.com/supertone-inc/supertonic.git
cd supertonic

# You need Git LFS for the model files
git lfs install
git clone https://huggingface.co/Supertone/supertonic assets

Step 2: Quickest Python demo

pip install onnxruntime soundfile tqdm

python py/inference.py \
  --text "Hello world! Today is November 21, 2025. I paid $99.99 for coffee." \
  --voice preset/ko-KR \
  --output hello_supertonic.wav

Play hello_supertonic.wav — numbers, dates, and currency pronounced perfectly, generated in a blink.

Step 3: Instant browser demo (WebGPU)

Just visit the official Space:
https://huggingface.co/spaces/Supertone/supertonic

Or run the web example locally — works in Chrome/Edge 121+ out of the box.

terminal running Supertonic inference
Photo by Pexels

Advanced Tuning: Squeeze Every Last Millisecond

Key command-line flags you’ll actually use:

--steps         # 2 = fastest, 5 = slightly more natural
--batch_size    # 4–16 for very long texts
--temperature   # 0.6–0.8 sweet spot
--cfg_weight    # default 1.0 works great

Real-world sweet spots I found:

  • M4 Pro WebGPU + --steps 2 → stable RTF 0.006
  • Android (Java example) + --steps 3 → best speed/quality trade-off
  • RTX 4090 + --batch_size 16 → 12k+ chars/sec

Deep Dive: Four Killer Applications

  1. Offline Audiobook / Read-Aloud Apps
    A 5,000-character article synthesizes in under 4 seconds on M4 Pro — entirely offline.

  2. Real-Time Game Dialogue
    Player types “Give me 3.14 apple pies for €42.00” → NPC responds instantly in natural voice.

  3. Privacy-First Browser Screen Readers
    Works on any webpage without ever sending text to a server.

  4. Always-On Smart Speakers
    Local weather reports, reminders, or jokes even when Wi-Fi is down.

My Two Cents
What blew me away wasn’t just the raw speed — it was the zero-preprocessing text handling. No more writing regex hell to make “$99.99” sound right. That alone makes Supertonic production-ready for most developers.

Licensing & Commercial Use

  • Code: MIT → do whatever you want
  • Model weights: OpenRAIL-M → commercial OK, just don’t use for illegal/harmful stuff

One-Page Cheat Sheet

Feature Detail
Parameters 66 million
Fastest RTF 0.001 (RTX 4090) / 0.006 (M4 Pro WebGPU)
Platforms Python, Node.js, Browser, Java, C++, C#, Go, Swift, iOS, Rust
Network required? Never
Numbers/Dates/Currency Native support, no preprocessing
Recommended steps 2 (speed) or 5 (max quality)
Model Hub https://huggingface.co/Supertone/supertonic
GitHub Repo https://github.com/supertone-inc/supertonic

Quick-Start Checklist

  1. Install Git LFS
  2. Clone repo + assets
  3. pip install dependencies (Python) or open the web folder
  4. Run a demo → hear magic
  5. Port to your target platform

FAQ

Q: Does Supertonic support Chinese voices yet?
Pre-trained presets are currently Korean/English-focused, but the architecture is multilingual and fine-tuning scripts are coming.

Q: Will it run smoothly on phones?
Yes — official Java (Android) and Swift (iOS) examples run 3–5 step inference fluently on 2023+ flagship devices.

Q: How does quality compare to ElevenLabs?
2-step is close to ElevenLabs Flash; 5-step reaches much of their Prime tier — entirely on-device.

Q: Which browsers support the WebGPU demo?
Chrome 121+, Edge 121+, Safari on macOS 15+.

Q: Is commercial use allowed?
Absolutely — MIT code + OpenRAIL-M weights.

Q: Why is a 66M model this fast?
Aggressive distillation + ONNX quantization + a novel 2-step diffusion design.

Q: Any voice cloning?
Current release is zero-shot multi-speaker only. Cloning tools may appear later.

Q: How can I contribute or train my own voice?
Watch the official GitHub — training code is on the roadmap.

Supertonic isn’t just another TTS model. It’s the first open on-device solution that finally feels like the future we were promised: instant, private, and actually usable. Go try the demo right now — you’ll hear the difference in under ten seconds.