Developers Might Finally Have a Local TTS Model That Doesn’t Suck

English

Transcript

00:00:00This is Supertonic. It's a local text-to-speech model that gets surprisingly close to 11 labs for
00:00:06a lot of developer use cases, except it runs on your machine, works offline, and costs nothing
00:00:11every time your app says a sentence. There's no API key, no cloud request, no GPU. And the real
00:00:18test is not whether it can say some really good script, it's whether it can handle the ugly stuff
00:00:22our app actually spits out. So I'm going to run it locally, throw weird text at it, and see if
00:00:29Supertonic 3 is actually something devs can ship with.
00:00:36We all know how TTS is and the problem that comes with it. Cloud TTS is easy at first. You call an
00:00:43API and you get audio back. It's done. But that simple setup has three hidden costs, money, latency,
00:00:49and privacy. Every request leaves our device, sure. Every user action becomes an API call. And every
00:00:57time your app grows, your speech bill grows with it. That might be fine for a simple project, but
00:01:03honestly, it just becomes a big pain. But if you are building a voice agent sending text to a third
00:01:08party server, that's going to become a serious problem. So then what do we do? Well, we try local
00:01:13TTS. And now we get these different types of problems. Some models are huge, some need a GPU,
00:01:20and some start really slow. And some sound okay on a clean demo, but break the second you feed them what apps
00:01:27actually produce. Let's say your balance is 12,575 cents due on June 15th. Call this number by 5:30 p.m.
00:01:36Those are a bunch of numbers. That's not some benchmark. That's what a normal app text might produce.
00:01:41Money, dates, phone numbers, time, just weird formatting. Hello, I'm Josh. That's easy. Production text is a lot more
00:01:49messy. So here's the question I'm trying to answer here. Can Supertonic 3 handle all that ugly real
00:01:55world stuff we actually need? Let's find out. If you enjoy coding tools to speed up your workflow,
00:02:00be sure to subscribe. We have videos coming out all the time.
00:02:03Now here's a Python script that I wrote up and all I needed to do was pip install Supertonic.
00:02:09I made a simple TTS object and some data structures for the voice, voice styles, and the demos that I
00:02:14want to run. To get it to run, I just took the TTS object and linked the synthesized method and passing
00:02:20in those keyword arguments. I also set these here to run automatically. Now first is just normal English.
00:02:27Let's play it. This is Supertonic running here on my Mac. If you like this,
00:02:32subscribe to the better stack channel. Yeah, that's exactly what we'd expect.
00:02:37That's the easy one. Now let's make it annoying with prices, phone numbers, and dates. I'm going
00:02:43to run it here again. The total invoice is 12 tons and zones for 58 75 due on June 15, 2026.
00:02:51No, right. Major lag when it comes to prices. That was actually pretty bad. This is where a lot of TTS
00:02:58systems start making weird choices and Supertonic was not an exception here.
00:03:03Also expressions are not going to work here either. This is on the local version, which is good as
00:03:08we're seeing, but if you want expressions, they're going to charge you for an API key and that's where
00:03:13they get us. I want a good local TTS that does expressions really well and that is still free.
00:03:20And those are hard to come by. Now let's test multiple languages. I'm going to start here with Arabic.
00:03:25Now my Arabic level is basic, but it sounded overall pretty clean. Here's some French we're
00:03:35going to output. Okay, again, sounded good. And then finally, here's some Korean.
00:03:47Okay, good. Right. Those multiple languages, they sounded really good. I don't speak those languages,
00:03:52but they sounded clean. Everything I just ran was local. And honestly, it was insanely fast.
00:03:57No internet, no API key, no hidden cloud request. But the deal is this. It handles normal text and
00:04:03other languages incredibly well. It was super fast. So I loved that. But when it came to numbers and
00:04:09expressions, the local version was not good or great by any means. So what is Supertonic 3 at a high
00:04:16level? It's an on-device text-to-speech model from Supertone. It has 99 million parameters. It runs
00:04:21locally on CPU through ONNX runtime. It supports 31 languages, and I don't need a GPU, a cloud server,
00:04:28or an API key, unless you want those expressions. Now it's small enough to actually think about
00:04:33shipping in real local tools. Not every app, obviously, but desktop apps, controlled environments,
00:04:39and cached local setups. This starts to make sense. Version 3 also expands language support,
00:04:45improves reading stability compared to Supertonic 2, and it does support expression style tags like laugh,
00:04:51breath, sigh. But again, what are we doing? We have to pay for those. I don't want that.
00:04:57Now this is the part devs actually care about. It's not just a model file dumped on the internet.
00:05:02There are examples for Python, browser, Java, C++, C#, a bunch of other languages. So it's not just,
00:05:09here's the research model. Good luck with that. The pitch here is, here is local TTS you can actually
00:05:15wire into your app. And honestly, the scripting, everything was really fast. There are two big
00:05:21reasons Supertonic 3 stands out. Speed and deployment. A lot of TTS models sound impressive.
00:05:28This sounded good, but then you try to use them in a product and suddenly you're dealing with big
00:05:33downloads, slow generations, cold starts, or hardware requirements your users don't have.
00:05:38Then deployment is incredibly simple, right? Pip install Supertonic. There is a Python SDK,
00:05:44CLI usage in a local HTTP server. And the local server includes an open AI compatible V1 audio speech
00:05:51alias. So open AI, boom. That means if your app already expects an open AI style speech API,
00:05:57you don't have to redesign everything. I can point the app at the local server and start testing.
00:06:02That is not just a nice detail. It's actually pretty great. Now let's compare it without pretending
00:06:07this tool wins every category. Cloud TTS from open AI, 11 labs, and other ones are great. If you want
00:06:14really good voices, hosted infrastructure, emotions, and zero model management, they're hard to beat,
00:06:20but the trade-off is clear. It costs money per use. It needs the internet, it adds network latency,
00:06:26and the user text leaves the device. So local TTS gives you privacy and control,
00:06:32but local models can bring their own problems. The setup pain, big files, inconsistent quality,
00:06:37and sometimes around deployment can be tough. Supertonic is interesting because it handles
00:06:43most of this really well. It's not the fanciest cloud voice. Well, it's not even a cloud voice,
00:06:48right? But it's not the fanciest by any means, but it's small enough, fast enough, and easy enough
00:06:53to test in a real app. But honestly, it kind of failed the tests that I actually cared about for
00:06:58this on a local version, which was emotions and prices or numbers. So run your own test on this.
00:07:04You could try invoices, support tickets, markdown, long paragraphs. That is how you find out if this TTS
00:07:10model works for you where you need it. All right. So my take is this. This may be one of the most
00:07:15practical local TTS options for devs who just want to ship faster, but that API key is no bueno. I
00:07:22wanted emotion and numbers handled well in this did not do that. So should you use Supertonic 3? Well,
00:07:28yeah, sure. Why not try it? If you're building a local voice agent, sure, give it a test. But
00:07:34skip it if your top priority is good narration. You want those emotions. You want the easiest possible
00:07:39voice cloning workflow. Maybe not that, right? For that, a cloud platform is still going to be a better
00:07:45choice. If you want to ship faster, you want to keep it private, local. This is really good. This
00:07:51is worth testing. If you enjoy coding tools like this, be sure to subscribe to the better stack channel.
00:07:56We'll see you in another video.

Description

In this video, I test Supertonic 3, a fast local text-to-speech model for developers that runs fully offline with no API key, no cloud request, and no GPU required. If you are building local AI voice agents, privacy-first apps, offline e-readers, or high-volume products where cloud TTS costs, latency, and privacy become a problem, this is worth paying attention to. I run Supertonic 3 on real developer text, including money, dates, phone numbers, expression tags, English, Spanish, French, and Arabic, to see if it can handle the messy strings that normal apps actually generate. 🔗 Relevant Links Supertonic Repo - https://github.com/supertone-inc/supertonic Supertonic HuggingFace - https://huggingface.co/spaces/Supertone/supertonic-3 ❤️ More about us Radically better observability stack: https://betterstack.com/ Written tutorials: https://betterstack.com/community/ Example projects: https://github.com/BetterStackHQ 📱 Socials Twitter: https://twitter.com/betterstackhq Instagram: https://www.instagram.com/betterstackhq/ TikTok: https://www.tiktok.com/@betterstack LinkedIn: https://www.linkedin.com/company/betterstack 📌 Chapters: 0:00 Supertonic 3 local TTS demo 0:37 Why cloud TTS is expensive for developers 1:58 Running Supertonic 3 offline on an M4 Mac 4:00 What is Supertonic 3? 4:30 Why Supertonic 3 is different from other TTS models 6:08 Supertonic 3 vs cloud TTS and local TTS models 6:50 Supertonic 3 pros and cons 7:14 Should developers use Supertonic 3 in 2026?

Community Posts

No posts yet. Be the first to write about this video!

Write about this video