Developers Might Finally Have a Local TTS Model That Doesn’t Suck

English

Transcript

00:00:00This is Supertonic. It's a local text-to-speech model that gets surprisingly close to 11 labs for

00:00:06a lot of developer use cases, except it runs on your machine, works offline, and costs nothing

00:00:11every time your app says a sentence. There's no API key, no cloud request, no GPU. And the real

00:00:18test is not whether it can say some really good script, it's whether it can handle the ugly stuff

00:00:22our app actually spits out. So I'm going to run it locally, throw weird text at it, and see if

00:00:29Supertonic 3 is actually something devs can ship with.

00:00:36We all know how TTS is and the problem that comes with it. Cloud TTS is easy at first. You call an

00:00:43API and you get audio back. It's done. But that simple setup has three hidden costs, money, latency,

00:00:49and privacy. Every request leaves our device, sure. Every user action becomes an API call. And every

00:00:57time your app grows, your speech bill grows with it. That might be fine for a simple project, but

00:01:03honestly, it just becomes a big pain. But if you are building a voice agent sending text to a third

00:01:08party server, that's going to become a serious problem. So then what do we do? Well, we try local

00:01:13TTS. And now we get these different types of problems. Some models are huge, some need a GPU,

00:01:20and some start really slow. And some sound okay on a clean demo, but break the second you feed them what apps

00:01:27actually produce. Let's say your balance is 12,575 cents due on June 15th. Call this number by 5:30 p.m.

00:01:36Those are a bunch of numbers. That's not some benchmark. That's what a normal app text might produce.

00:01:41Money, dates, phone numbers, time, just weird formatting. Hello, I'm Josh. That's easy. Production text is a lot more

00:01:49messy. So here's the question I'm trying to answer here. Can Supertonic 3 handle all that ugly real

00:01:55world stuff we actually need? Let's find out. If you enjoy coding tools to speed up your workflow,

00:02:00be sure to subscribe. We have videos coming out all the time.

00:02:03Now here's a Python script that I wrote up and all I needed to do was pip install Supertonic.

00:02:09I made a simple TTS object and some data structures for the voice, voice styles, and the demos that I

00:02:14want to run. To get it to run, I just took the TTS object and linked the synthesized method and passing

00:02:20in those keyword arguments. I also set these here to run automatically. Now first is just normal English.

00:02:27Let's play it. This is Supertonic running here on my Mac. If you like this,

00:02:32subscribe to the better stack channel. Yeah, that's exactly what we'd expect.

00:02:37That's the easy one. Now let's make it annoying with prices, phone numbers, and dates. I'm going

00:02:43to run it here again. The total invoice is 12 tons and zones for 58 75 due on June 15, 2026.

00:02:51No, right. Major lag when it comes to prices. That was actually pretty bad. This is where a lot of TTS

00:02:58systems start making weird choices and Supertonic was not an exception here.

00:03:03Also expressions are not going to work here either. This is on the local version, which is good as

00:03:08we're seeing, but if you want expressions, they're going to charge you for an API key and that's where

00:03:13they get us. I want a good local TTS that does expressions really well and that is still free.

00:03:20And those are hard to come by. Now let's test multiple languages. I'm going to start here with Arabic.

00:03:25Now my Arabic level is basic, but it sounded overall pretty clean. Here's some French we're

00:03:35going to output. Okay, again, sounded good. And then finally, here's some Korean.

00:03:47Okay, good. Right. Those multiple languages, they sounded really good. I don't speak those languages,

00:03:52but they sounded clean. Everything I just ran was local. And honestly, it was insanely fast.

00:03:57No internet, no API key, no hidden cloud request. But the deal is this. It handles normal text and

00:04:03other languages incredibly well. It was super fast. So I loved that. But when it came to numbers and

00:04:09expressions, the local version was not good or great by any means. So what is Supertonic 3 at a high

00:04:16level? It's an on-device text-to-speech model from Supertone. It has 99 million parameters. It runs

00:04:21locally on CPU through ONNX runtime. It supports 31 languages, and I don't need a GPU, a cloud server,

00:04:28or an API key, unless you want those expressions. Now it's small enough to actually think about

00:04:33shipping in real local tools. Not every app, obviously, but desktop apps, controlled environments,

00:04:39and cached local setups. This starts to make sense. Version 3 also expands language support,

00:04:45improves reading stability compared to Supertonic 2, and it does support expression style tags like laugh,

00:04:51breath, sigh. But again, what are we doing? We have to pay for those. I don't want that.

00:04:57Now this is the part devs actually care about. It's not just a model file dumped on the internet.

00:05:02There are examples for Python, browser, Java, C++, C#, a bunch of other languages. So it's not just,

00:05:09here's the research model. Good luck with that. The pitch here is, here is local TTS you can actually

00:05:15wire into your app. And honestly, the scripting, everything was really fast. There are two big

00:05:21reasons Supertonic 3 stands out. Speed and deployment. A lot of TTS models sound impressive.

00:05:28This sounded good, but then you try to use them in a product and suddenly you're dealing with big

00:05:33downloads, slow generations, cold starts, or hardware requirements your users don't have.

00:05:38Then deployment is incredibly simple, right? Pip install Supertonic. There is a Python SDK,

00:05:44CLI usage in a local HTTP server. And the local server includes an open AI compatible V1 audio speech

00:05:51alias. So open AI, boom. That means if your app already expects an open AI style speech API,

00:05:57you don't have to redesign everything. I can point the app at the local server and start testing.

00:06:02That is not just a nice detail. It's actually pretty great. Now let's compare it without pretending

00:06:07this tool wins every category. Cloud TTS from open AI, 11 labs, and other ones are great. If you want

00:06:14really good voices, hosted infrastructure, emotions, and zero model management, they're hard to beat,

00:06:20but the trade-off is clear. It costs money per use. It needs the internet, it adds network latency,

00:06:26and the user text leaves the device. So local TTS gives you privacy and control,

00:06:32but local models can bring their own problems. The setup pain, big files, inconsistent quality,

00:06:37and sometimes around deployment can be tough. Supertonic is interesting because it handles

00:06:43most of this really well. It's not the fanciest cloud voice. Well, it's not even a cloud voice,

00:06:48right? But it's not the fanciest by any means, but it's small enough, fast enough, and easy enough

00:06:53to test in a real app. But honestly, it kind of failed the tests that I actually cared about for

00:06:58this on a local version, which was emotions and prices or numbers. So run your own test on this.

00:07:04You could try invoices, support tickets, markdown, long paragraphs. That is how you find out if this TTS

00:07:10model works for you where you need it. All right. So my take is this. This may be one of the most

00:07:15practical local TTS options for devs who just want to ship faster, but that API key is no bueno. I

00:07:22wanted emotion and numbers handled well in this did not do that. So should you use Supertonic 3? Well,

00:07:28yeah, sure. Why not try it? If you're building a local voice agent, sure, give it a test. But

00:07:34skip it if your top priority is good narration. You want those emotions. You want the easiest possible

00:07:39voice cloning workflow. Maybe not that, right? For that, a cloud platform is still going to be a better

00:07:45choice. If you want to ship faster, you want to keep it private, local. This is really good. This

00:07:51is worth testing. If you enjoy coding tools like this, be sure to subscribe to the better stack channel.

00:07:56We'll see you in another video.

Description

In this video, I test Supertonic 3, a fast local text-to-speech model for developers that runs fully offline with no API key, no cloud request, and no GPU required. If you are building local AI voice agents, privacy-first apps, offline e-readers, or high-volume products where cloud TTS costs, latency, and privacy become a problem, this is worth paying attention to. I run Supertonic 3 on real developer text, including money, dates, phone numbers, expression tags, English, Spanish, French, and Arabic, to see if it can handle the messy strings that normal apps actually generate. 🔗 Relevant Links Supertonic Repo - https://github.com/supertone-inc/supertonic Supertonic HuggingFace - https://huggingface.co/spaces/Supertone/supertonic-3 ❤️ More about us Radically better observability stack: https://betterstack.com/ Written tutorials: https://betterstack.com/community/ Example projects: https://github.com/BetterStackHQ 📱 Socials Twitter: https://twitter.com/betterstackhq Instagram: https://www.instagram.com/betterstackhq/ TikTok: https://www.tiktok.com/@betterstack LinkedIn: https://www.linkedin.com/company/betterstack 📌 Chapters: 0:00 Supertonic 3 local TTS demo 0:37 Why cloud TTS is expensive for developers 1:58 Running Supertonic 3 offline on an M4 Mac 4:00 What is Supertonic 3? 4:30 Why Supertonic 3 is different from other TTS models 6:08 Supertonic 3 vs cloud TTS and local TTS models 6:50 Supertonic 3 pros and cons 7:14 Should developers use Supertonic 3 in 2026?

Community Posts

No posts yet. Be the first to write about this video!

Write about this video