00:00:00This is Vibe Voice by Microsoft, and I used it to clone my own voice.
00:00:04An open source speech stack that's already getting compared to 11 Labs, Chatterbox, and Whisper.
00:00:10But it runs offline, and it can generate 90 minutes of multi-speaker audio in one pass.
00:00:1590 minutes or anything close to that sounds a bit wild. So is this actually usable for developers,
00:00:20or is it another research repo that quietly kills our GPUs? I'll run through some demos,
00:00:26and then we're going to see how it compares to others. We have videos coming out all the time,
00:00:29be sure to subscribe.
00:00:31You can get all this from their repo or on Hugging Face. Now before we compare anything,
00:00:40let's just look at the outputs. This is all prepped, set up, running in front,
00:00:45so we can focus on what matters. I've used others, so I'm actually interested to see how Vibe Voice
00:00:51sounds, how it can hold up, and how do we get something useful from the outputs.
00:00:56I'm going to do all this as a multi-speaker output, a real-time TTS, then the voice cloning.
00:01:02Here's a short podcast-style script with three speakers, clean turn-taking, and audio emotions.
00:01:08Now what you'd expect from most TTS demos is it sounds decent and then it starts drifting,
00:01:14but just listen what happens here. Speaker consistency seems to stay solid,
00:01:18and transitions don't actually collapse. Let's take a listen.
00:01:26I mean, it sounds alright, right? It doesn't sound like it's making up context after 20 seconds,
00:01:41right? There we go. That's the big point. Microsoft hasn't just made this for short play projects.
00:01:46It's made for longer context audio generation and offline too. But when adding emotion tags,
00:01:52it starts to fall apart. Unlike Chatterbox, for example, it does auto-emotion based on the words,
00:01:58and that's not actually that great. I didn't like that. Chatterbox still kind of won here.
00:02:02But if you're building things like AI podcasts, narrated docs, long-form agents,
00:02:07or just training data, this might actually do a decent job at those.
00:02:11Now let's switch gears here into real-time mode. This runs a lot faster than multi-speaker,
00:02:16which honestly took a long time to generate. This now is incremental streaming, so think
00:02:22chatbot responses, voice agents, and assistants. First, chunky latency is around 300 milliseconds,
00:02:28which is usable. It's not the fastest I've used. Let's take a listen here.
00:02:32Imagine drinking hot chocolate in Japan under cherry blossoms.
00:02:35Okay. And yes, they say it can sing or even generate background music. If you push it,
00:02:40that didn't work. I pushed it. It didn't work. But the point is here,
00:02:43is this production ready real-time? I don't think so. But for experimentation and agents,
00:02:48yeah, this is pretty good. Now the fun stuff. Let's talk about the voice cloning because that
00:02:53was really, really cool. Here was my setup for that. First, I recorded myself on voice memos.
00:02:58I'm on a Mac. I then converted that file into a WAV file, and I launched Gradio with this command.
00:03:04From this interactive interface, I can then choose my own voice as the target language.
00:03:10That's it. Just a normal recording. And what you'd expect is something close to my voice,
00:03:14but obviously fake. Take a listen to this. This is my voice cloned using vibe voice.
00:03:19It honestly sounds really good. Almost too good because I didn't say any of this. Now that did
00:03:25sound similar to me, but if you know me, then you'd probably still tell it's a fake. At least I hope
00:03:30so. Now it's not perfect, but consistent and it's stable and it's used across longer outputs. That's
00:03:36great. Microsoft says this stack can handle long form generation in a single pass and in practice
00:03:41stays noticeably more stable than whisper style pipelines once the audio gets longer, right? And
00:03:47if you've ever tried cloning a voice more than a short clip, you know why this matters. So yeah,
00:03:52the demos were impressive, I guess. I had fun with those, the voice cloning, but I went through the
00:03:56docs, the issues, some threads, and it is a mix from other devs. Now the pros first, then the stuff
00:04:02that you're going to run into. The pros here are solid for the most part. It's long form for sure,
00:04:08right? Most TTS systems drift, they flatten or they break after a few minutes. Vibe voice is made for
00:04:14longer audio and it showed here and it showed in my longer demos. Then efficiency plus expressiveness.
00:04:20It uses low frequency audio tokenizers, which keeps the context more manageable. And diffusion plus an
00:04:27LLM backbone and you get expressive speech without absurd compute. It felt a bit more dev friendly
00:04:33by design, right? This was nice. It's MIT licensed. It runs offline. It runs consumer GPUs around seven
00:04:40GB VRAM for real time. And fine tuning code is included, especially for ASR. This is not a
00:04:47lockdown of any sorts, but it's really good. Finally, like some other open source, it's structured
00:04:53ASR output. Huge win. Speaker diarization plus timestamps out of the box saves a lot of time
00:04:59downstream. If you've built transcription pipelines, you know that that's not a small thing. Now the
00:05:04drawbacks I definitely felt here and I saw them as well. This is kind of just like a research software.
00:05:11Microsoft pulls some TTS code paths due to deepfake concerns that tells you kind of everything. The SDK,
00:05:17it's not a grand slam. It's not polished, right? There are obviously some audio quirks as I found
00:05:23with other softwares. You might hear some robotic intonation. Sometimes pacing is going to feel off
00:05:28and multi-speaker scenes beyond two or three people that degrades. Devs seem to love the tokenizer and
00:05:33hate the VRAM spikes. And there is only a limited language coverage. So Chinese and English, they're
00:05:40great. But if you need any other types of languages, vibe voice is not going to be that. Lastly, the
00:05:46drawback of zero semantic understanding like this thing reads text, but it doesn't understand it.
00:05:51Emotion tags can help, but they still glitch out a lot if we're adding in those tags. So the honest
00:05:56thing here, it's an incredible tool for experimentation and things, but long-term, I'm not
00:06:02sure if this is going to hold up. Now, the answer that you actually want to know, is this worth your
00:06:06time compared to what you're already using in your workflows? How well does vibe voice stack up against
00:06:11the other usual competitors? Let's start with vibe voice versus chatterbox. I did a video and played
00:06:16around with chatterbox in the past. That was honestly really sweet. Chatterbox had sub 200
00:06:22milliseconds latency, stronger emotional punch and better short agent replies. So you'd think
00:06:28chatterbox just wins, but vibe voice destroys it on long form. Chatterbox is built for 30 minute less
00:06:35monologues or podcast style outputs and vibe voice handles that long form much better. So it's a give
00:06:42and a take with that. Then of course we have vibe voice in 11 labs. This one's simple, right? 11 labs
00:06:48wins because you have your polisher pronunciation, your zero shot voice cloning the UX, but where vibe
00:06:54voice wins is the cost. It's free. It's offline. It's open source, right? That's a huge win here.
00:07:00We're not paying for software. You have vibe voice and whisper or even cozy voice. It beats whisper
00:07:06once audio gets long and structured. It's more expressive than cozy voice and quen based TTS models
00:07:13are catching up in dialects, but vibe voice still leaves in content length. If you're a dev who builds
00:07:18locally, you like open source and you care about long form audio, I think vibe voice is worth your
00:07:23time. If you want something that's more plug and play production ready, honestly, you can probably
00:07:28skip this for now. It's just a really cool project to play around with, including that voice cloning.
00:07:33Vibe voice is messy. It's powerful, but it's also exciting. It's one of the strongest open
00:07:37source audio stacks we've seen for long form AI speech in a long time. Try the Hugging Face demo,
00:07:43read some docs, and we'll see you in another video.