00:00:00An 82 million parameter model just beat much larger TTS systems, and it runs locally on
00:00:06a laptop faster than most paid APIs.
00:00:09Last month I paid for a cloud TTS, but still got some lag.
00:00:13That made no sense to me.
00:00:14How are some of these open source models beating this?
00:00:17This is Kokoro 82M, and it's already being shipped by some devs.
00:00:22Let's see how this works and better yet, how it sounds.
00:00:30Okay now if you're building with text-to-speech you're usually choosing between two bad options.
00:00:36First option is obviously cloud APIs, right?
00:00:39They're easy to start, but now you've got these bills, latency spikes, and one more dependency
00:00:44every time your app speaks.
00:00:46Then the next option would be something like these big open models, but now you need a lot
00:00:51more hardware, more memory, and it's still, let's face it, not that fast.
00:00:56So the thing that's supposed to feel smooth ends up feeling slow, expensive, or it just
00:01:00plain breaks.
00:01:02This is where Kokoro fits in.
00:01:04It was trained on less than 100 hours of data, but still ranks at the top of leaderboards.
00:01:09It beats much larger models with a fraction of the size, it's Apache 2.0, runs on a CPU,
00:01:15and it flies on Apple Silicon, and generates speech honestly insanely fast.
00:01:19So now local voice apps and real-time agents actually start to make more sense.
00:01:24If you enjoy coding tools and tips like this, be sure to subscribe.
00:01:27We have videos coming out all the time.
00:01:29Alright now let me show you this.
00:01:31I'm running all this locally on a Mac M4 Pro.
00:01:34The setup takes like 30 seconds, I'll just run with this pip command here.
00:01:39I am in a conda environment, but that's pretty much it.
00:01:42I've got this whole Python script from their official repo, I didn't have to change anything
00:01:47to test this out, it's just drag and drop, we get all these outputs.
00:01:51I can choose a voice and a language right here, but for the first round I'm just gonna leave
00:01:56it set as it is because honestly it sounds really good.
00:02:00I'm gonna run it and then let's listen.
00:02:02Better Stack is the leading observability platform.
00:02:05That makes monitoring simple.
00:02:07It has AISRE, logs, metrics, traces, error tracking.
00:02:12And incident response all in one place.
00:02:14Not gonna lie, that was pretty good, and it came out really fast.
00:02:19Now if I flip the switch, let's do French and switch to the French voice.
00:02:24Change the text a little bit and again let's run it.
00:02:26Better Stack is the platform for observability in parallel.
00:02:29It simplifies the monitoring.
00:02:31Okay now my French is rusty so don't translate that word for word, but that sounded pretty
00:02:36good as well.
00:02:37You guys can be the judge of that though.
00:02:39It all saves as a WAV file so I can download them as I want.
00:02:43There's no cloud.
00:02:44There's no GPU.
00:02:45That was pretty crazy.
00:02:47So what actually is Kokoro 82M?
00:02:49At a high level it's a style TTS2 model with a lightweight vocoder.
00:02:55All that means is it's built to sound good without being huge, and that's really the key
00:02:59difference here.
00:03:00Most other options go bigger.
00:03:01So XTTS, Cozy Voice, F5 TTS, hundreds of millions to over a billion parameters.
00:03:08Then cloud tools like 11 Labs or OpenAI, they do solve the hardware problem, but now we're
00:03:13paying per request and sending our data out.
00:03:16Kokoro goes the other direction.
00:03:19It's small, it's fast to start, and it runs locally, plus it uses way less memory.
00:03:24But the downfalls are, it doesn't do zero shot voice cloning out of the box, instead
00:03:29it focuses on efficiency and quality that we could actually ship a lot faster.
00:03:33We still get 8 languages, 54 voices, and pretty good control with their import Misaki.
00:03:39I can see where all this is going to fit really well in different types of agents, but you
00:03:42do not get any type of emotion, which is what I really wanted to see here.
00:03:47An AI without emotion is still going to sound heavily like AI, which I guess can be good
00:03:52at times, right?
00:03:53But it would be fun to play around with that emotion.
00:03:56So why are devs actually using this?
00:03:58Well, if I didn't show you, let's touch on it, because it fixes the stuff that usually
00:04:02breaks voice features.
00:04:04First is the speed.
00:04:05If your agent pauses too long and stops feeling real, Kokoro cuts that delay way down.
00:04:11Then the offline use is here.
00:04:13There's no internet, there's no API keys, I don't have any random failures.
00:04:16That's great.
00:04:17The privacy is pretty big because Kokoro keeps everything local, so for me, for a lot of you,
00:04:22that might be a huge win.
00:04:23And finally, cost at scale.
00:04:26Because it's so lightweight, you can run way more instances on one machine.
00:04:30What's great and what's not, I loved, it's fast and small.
00:04:33It sounds natural for long form content.
00:04:35That was really cool.
00:04:36I've played around with a bunch of these.
00:04:38It is Apache 2.0, so you could ship it, and after setup, it's basically free.
00:04:43All of these are really, really nice.
00:04:44Now, I love those.
00:04:45That was cool.
00:04:46But there are things that I didn't like.
00:04:47The no native voice cloning, it depends if you need voice cloning, okay, could have had
00:04:51that.
00:04:52Emotion is pretty neutral.
00:04:54Great for narration, it's not great for anything dramatic.
00:04:56I mean, there really is no ability to change emotion here, plus a non-English voices are
00:05:02still improving.
00:05:03So that needs to be added, maybe not, depends on how you view this.
00:05:07So is it perfect?
00:05:08No.
00:05:09But for the problems, most of us actually have cost latency privacy deployment.
00:05:14It does seem to solve the right ones right now.
00:05:18Play around with it and let me know.
00:05:19Kokoro 82m proves you don't need a massive model to get really good TTS.
00:05:24Smaller means faster, faster means usable, and then usable usually means you can actually
00:05:29ship it.
00:05:30If you're building voice agents or local tools, this is worth trying out.
00:05:34If you enjoy coding tools and tips like this, be sure to subscribe to the better stack channel.
00:05:38We'll see you in another video.