00:00:00This could have been done with an email.
00:00:02This could have been done with an email.
00:00:04Same sentence, two completely different performances.
00:00:07I just typed start normal then slowly turned into a frustrated rant.
00:00:11That's it.
00:00:12No markup, no API sending your data somewhere else.
00:00:15This is Quen 3 TTS.
00:00:17Their new open source voice model that lets you direct tone and actually listens.
00:00:22Let's see how this stacks up to Eleven Labs or even Shatterbox.
00:00:30Many of the open source voice models lack any type of emotion.
00:00:34I've done Shatterbox and that was actually decent.
00:00:37So knowing Quen has this, I wanted to not only see the voice cloning,
00:00:41but also how their language emotion stacks up with the others.
00:00:44And honestly, I was pleasantly surprised.
00:00:47Shatterbox has an emotion slider where here in Quen,
00:00:50you literally type out how you want it to sound so it gives us a bit more freedom.
00:00:55On the lighter model, it has three second voice cloning, which we're going to check that out.
00:00:59Then when we beef it up to the 1.7 B, we lose voice cloning,
00:01:02but we get real time streaming with 97 milliseconds latency,
00:01:0510 languages with natural code switching, and it's 100% local.
00:01:09It's free.
00:01:09It's Apache 2.0.
00:01:11That means faster prototyping, private voice agents, accessibility tools.
00:01:16If you're always looking for the latest tools, be sure to subscribe.
00:01:19We have videos coming out all the time.
00:01:21Now cloning is easy.
00:01:22Emotion is harder.
00:01:23So let's try to break this.
00:01:25We will test cloning first.
00:01:28So I'll first upload my voice that I already recorded as the reference here.
00:01:32Then in reference text, I need to type out what I recorded in that audio.
00:01:37Over here in target text is where I will type what I want the output to become.
00:01:42That's it.
00:01:43Now this actually took a lot longer than I thought to run.
00:01:46So I was hoping the quality would match, but let's take a listen.
00:01:49How does this sound using this model?
00:01:51Now, I mean, that was okay for a lighter model, especially Quen,
00:01:55but you can clearly hear some areas that sounded generated.
00:01:59So it wasn't amazing by any means.
00:02:01The best voice clone audio that I found was Vibe Voice from Microsoft, which was insane.
00:02:07This was just decent.
00:02:08Okay.
00:02:09So voice cloning is done.
00:02:10Check.
00:02:11But now let's beef it up with the 1.7b model and switch it over to start adding emotion
00:02:16into text to just see how Quen handles this.
00:02:19Let me show you something that actually feels useful.
00:02:22I will type in the instruct box here, tell this like a suspenseful narrator,
00:02:26slow buildup, and then relieved laugh at the end.
00:02:28And over here, I want it to say some basic info about Quen because we're doing that.
00:02:32Why not?
00:02:33Let's take a listen.
00:02:34Alibaba's new open source text to speech model that
00:02:37finally feels like you're talking to a real voice actor.
00:02:42Okay.
00:02:42So we did hear a little discrepancy.
00:02:44This didn't pick up every tone, but it did get a lot right.
00:02:47There's no dropdowns, no presets.
00:02:49We are guiding it to how you want it to sound.
00:02:51Now let's make a voice that feels like someone we might actually interact with.
00:02:55Maybe we're building a project.
00:02:57Let's drop in some stuff here.
00:02:58I'm going to say something about writing tests.
00:03:01And then in the instruct box, let's say young,
00:03:03enthusiastic developer voice, a bit sarcastic, but friendly.
00:03:07Now this isn't me picking voice preset 12.
00:03:10I described exactly how I want that personality to sound.
00:03:13Let's take a listen.
00:03:14Writing code tests means carefully checking that your program does what it is supposed to do.
00:03:20Now you might be thinking, how does this compare to others?
00:03:22Well, 11 labs is still king, but it costs you money and your data leaves your machine.
00:03:26Chatterbox is excellent.
00:03:28One of the better ones I've used and it does have good emotion.
00:03:31If you're still after voice cloning, then I'm going to stand by vibe voice, which was scary good.
00:03:36Quinn for ETTS wins when you want to describe the voice naturally and iterate fast.
00:03:41Obviously there are some good things here.
00:03:43I like like natural language control for fastest iteration.
00:03:47It's fully local and private streaming ready for
00:03:50real-time agents and voice design here feels somewhat more intuitive.
00:03:55Then what we don't like about this, or what I should say.
00:03:57I don't like about this is it is a newer model, right?
00:04:00So it's still maturing in some languages.
00:04:03Like with any TTS GPU is recommended for best performance.
00:04:06Those CPU works.
00:04:07It's just going to be slower.
00:04:09And emotion really depends on how well you prompt it, how well you instruct it.
00:04:13If your direction is vague, the output is going to be vague too.
00:04:16So the big question is the setup painful?
00:04:19No, absolutely not.
00:04:20Super straightforward.
00:04:22Clone the repo, install dependencies, launch the web UI, open local host.
00:04:26That is all I did here from zero to working demo in literally just a few minutes.
00:04:32There's no API keys.
00:04:33There's no billing.
00:04:34And it's just on your machine.
00:04:35That is what open source voice should feel like.
00:04:38That's why playing with these open source voice tools is really cool to see what has what.
00:04:43Quen 3 TDS, fast, private, and it's more dev controlled.
00:04:46So try it yourself.
00:04:48I dropped them below.
00:04:49And if you want more local tools like this, be sure to subscribe.
00:04:52We'll see you in another video.