Qwen TTS Just Changed Open-Source Voice

BBetter Stack
컴퓨터/소프트웨어창업/스타트업어학(외국어)AI/미래기술

Transcript

00:00:00This could have been done with an email.
00:00:02This could have been done with an email.
00:00:04Same sentence, two completely different performances.
00:00:07I just typed start normal then slowly turned into a frustrated rant.
00:00:11That's it.
00:00:12No markup, no API sending your data somewhere else.
00:00:15This is Quen 3 TTS.
00:00:17Their new open source voice model that lets you direct tone and actually listens.
00:00:22Let's see how this stacks up to Eleven Labs or even Shatterbox.
00:00:30Many of the open source voice models lack any type of emotion.
00:00:34I've done Shatterbox and that was actually decent.
00:00:37So knowing Quen has this, I wanted to not only see the voice cloning,
00:00:41but also how their language emotion stacks up with the others.
00:00:44And honestly, I was pleasantly surprised.
00:00:47Shatterbox has an emotion slider where here in Quen,
00:00:50you literally type out how you want it to sound so it gives us a bit more freedom.
00:00:55On the lighter model, it has three second voice cloning, which we're going to check that out.
00:00:59Then when we beef it up to the 1.7 B, we lose voice cloning,
00:01:02but we get real time streaming with 97 milliseconds latency,
00:01:0510 languages with natural code switching, and it's 100% local.
00:01:09It's free.
00:01:09It's Apache 2.0.
00:01:11That means faster prototyping, private voice agents, accessibility tools.
00:01:16If you're always looking for the latest tools, be sure to subscribe.
00:01:19We have videos coming out all the time.
00:01:21Now cloning is easy.
00:01:22Emotion is harder.
00:01:23So let's try to break this.
00:01:25We will test cloning first.
00:01:28So I'll first upload my voice that I already recorded as the reference here.
00:01:32Then in reference text, I need to type out what I recorded in that audio.
00:01:37Over here in target text is where I will type what I want the output to become.
00:01:42That's it.
00:01:43Now this actually took a lot longer than I thought to run.
00:01:46So I was hoping the quality would match, but let's take a listen.
00:01:49How does this sound using this model?
00:01:51Now, I mean, that was okay for a lighter model, especially Quen,
00:01:55but you can clearly hear some areas that sounded generated.
00:01:59So it wasn't amazing by any means.
00:02:01The best voice clone audio that I found was Vibe Voice from Microsoft, which was insane.
00:02:07This was just decent.
00:02:08Okay.
00:02:09So voice cloning is done.
00:02:10Check.
00:02:11But now let's beef it up with the 1.7b model and switch it over to start adding emotion
00:02:16into text to just see how Quen handles this.
00:02:19Let me show you something that actually feels useful.
00:02:22I will type in the instruct box here, tell this like a suspenseful narrator,
00:02:26slow buildup, and then relieved laugh at the end.
00:02:28And over here, I want it to say some basic info about Quen because we're doing that.
00:02:32Why not?
00:02:33Let's take a listen.
00:02:34Alibaba's new open source text to speech model that
00:02:37finally feels like you're talking to a real voice actor.
00:02:42Okay.
00:02:42So we did hear a little discrepancy.
00:02:44This didn't pick up every tone, but it did get a lot right.
00:02:47There's no dropdowns, no presets.
00:02:49We are guiding it to how you want it to sound.
00:02:51Now let's make a voice that feels like someone we might actually interact with.
00:02:55Maybe we're building a project.
00:02:57Let's drop in some stuff here.
00:02:58I'm going to say something about writing tests.
00:03:01And then in the instruct box, let's say young,
00:03:03enthusiastic developer voice, a bit sarcastic, but friendly.
00:03:07Now this isn't me picking voice preset 12.
00:03:10I described exactly how I want that personality to sound.
00:03:13Let's take a listen.
00:03:14Writing code tests means carefully checking that your program does what it is supposed to do.
00:03:20Now you might be thinking, how does this compare to others?
00:03:22Well, 11 labs is still king, but it costs you money and your data leaves your machine.
00:03:26Chatterbox is excellent.
00:03:28One of the better ones I've used and it does have good emotion.
00:03:31If you're still after voice cloning, then I'm going to stand by vibe voice, which was scary good.
00:03:36Quinn for ETTS wins when you want to describe the voice naturally and iterate fast.
00:03:41Obviously there are some good things here.
00:03:43I like like natural language control for fastest iteration.
00:03:47It's fully local and private streaming ready for
00:03:50real-time agents and voice design here feels somewhat more intuitive.
00:03:55Then what we don't like about this, or what I should say.
00:03:57I don't like about this is it is a newer model, right?
00:04:00So it's still maturing in some languages.
00:04:03Like with any TTS GPU is recommended for best performance.
00:04:06Those CPU works.
00:04:07It's just going to be slower.
00:04:09And emotion really depends on how well you prompt it, how well you instruct it.
00:04:13If your direction is vague, the output is going to be vague too.
00:04:16So the big question is the setup painful?
00:04:19No, absolutely not.
00:04:20Super straightforward.
00:04:22Clone the repo, install dependencies, launch the web UI, open local host.
00:04:26That is all I did here from zero to working demo in literally just a few minutes.
00:04:32There's no API keys.
00:04:33There's no billing.
00:04:34And it's just on your machine.
00:04:35That is what open source voice should feel like.
00:04:38That's why playing with these open source voice tools is really cool to see what has what.
00:04:43Quen 3 TDS, fast, private, and it's more dev controlled.
00:04:46So try it yourself.
00:04:48I dropped them below.
00:04:49And if you want more local tools like this, be sure to subscribe.
00:04:52We'll see you in another video.

Key Takeaway

Qwen-7B-TTS marks a significant shift in open-source AI by allowing users to direct vocal emotions and styles through simple natural language prompts rather than complex technical parameters.

Highlights

Qwen-7B-TTS (referred to as Qwen 3 in the video) offers natural language instruction for voice performance instead of rigid sliders or presets.

The model is 100% local and open-source under the Apache 2.0 license, ensuring data privacy and no API costs.

The 1.7B parameter version features 97ms low-latency streaming and supports natural code-switching across 10 languages.

Voice cloning is available on the lighter model with as little as three seconds of reference audio.

Comparison with industry leaders reveals that while Eleven Labs is superior in quality, Qwen wins on speed, privacy, and natural iteration.

Installation is developer-friendly, requiring only a repository clone and dependency install to run a local web UI.

Timeline

Introduction to Natural Language Voice Direction

The speaker demonstrates the power of Qwen-7B-TTS by showing how the same sentence can be delivered with vastly different emotional performances. Unlike traditional models that require markup or specific APIs, this tool allows users to simply type instructions like “slowly turn into a frustrated rant.” This section introduces the core value proposition of a model that actually “listens” to stylistic directions. The speaker emphasizes that this is an open-source alternative to proprietary giants. It sets the stage for a deep dive into how Qwen compares to existing solutions like Eleven Labs and Chatterbox.

Technical Specs and Model Variants

This segment breaks down the differences between the lighter model and the beefed-up 1.7B parameter version. The lighter model excels at rapid three-second voice cloning, while the 1.7B version prioritizes performance with 97-millisecond latency for real-time streaming. It supports 10 languages and natural code-switching, making it ideal for international applications. Because it is released under the Apache 2.0 license, it is free for prototyping and building private voice agents. The speaker notes that being 100% local is a massive advantage for privacy-conscious developers.

Testing Voice Cloning Performance

The speaker puts the lighter model to the test by attempting to clone their own voice using a short reference recording. The process involves uploading audio, providing reference text, and then entering the target text for the output. While the setup is simple, the speaker admits that the processing time was longer than expected and the final quality had some “generated” artifacts. They compare the results unfavorably to Microsoft's Vibe Voice, which they consider the current gold standard for cloning. This section highlights that while Qwen is versatile, its cloning capabilities are currently just “decent” rather than revolutionary.

Directing Emotion and Personality

The focus shifts to the 1.7B model where the speaker explores “instruct-based” text-to-speech. They demonstrate this by asking the AI to act as a “suspenseful narrator” with a “relieved laugh” at the end, followed by a “sarcastic but friendly” developer persona. This method eliminates the need for dropdown menus or preset IDs, giving the user total creative freedom over the performance. The model successfully captures the nuances of the requested personalities, illustrating a more intuitive way to design voices. The speaker emphasizes that describing a personality is much more efficient than toggling technical sliders.

Market Comparison and Final Verdict

In the concluding section, the speaker weighs the pros and cons of Qwen against competitors like Eleven Labs and Chatterbox. While Eleven Labs remains the “king” of sheer quality, Qwen is preferred for fast iteration, local privacy, and developer control. The main drawbacks identified are the need for a decent GPU and the fact that output quality is heavily dependent on the quality of the user's prompts. Setup is praised for being extremely straightforward, involving a simple repository clone and local host launch without the need for API keys. Ultimately, the speaker recommends Qwen for anyone looking for a private, fast, and highly customizable open-source voice tool.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video