I Cloned My Own Voice Using Microsoft’s Open-Source Model

BBetter Stack
Computing/SoftwareSmall Business/StartupsConsumer ElectronicsInternet Technology

Transcript

00:00:00This is Vibe Voice by Microsoft, and I used it to clone my own voice.
00:00:04An open source speech stack that's already getting compared to 11 Labs, Chatterbox, and Whisper.
00:00:10But it runs offline, and it can generate 90 minutes of multi-speaker audio in one pass.
00:00:1590 minutes or anything close to that sounds a bit wild. So is this actually usable for developers,
00:00:20or is it another research repo that quietly kills our GPUs? I'll run through some demos,
00:00:26and then we're going to see how it compares to others. We have videos coming out all the time,
00:00:29be sure to subscribe.
00:00:31You can get all this from their repo or on Hugging Face. Now before we compare anything,
00:00:40let's just look at the outputs. This is all prepped, set up, running in front,
00:00:45so we can focus on what matters. I've used others, so I'm actually interested to see how Vibe Voice
00:00:51sounds, how it can hold up, and how do we get something useful from the outputs.
00:00:56I'm going to do all this as a multi-speaker output, a real-time TTS, then the voice cloning.
00:01:02Here's a short podcast-style script with three speakers, clean turn-taking, and audio emotions.
00:01:08Now what you'd expect from most TTS demos is it sounds decent and then it starts drifting,
00:01:14but just listen what happens here. Speaker consistency seems to stay solid,
00:01:18and transitions don't actually collapse. Let's take a listen.
00:01:26I mean, it sounds alright, right? It doesn't sound like it's making up context after 20 seconds,
00:01:41right? There we go. That's the big point. Microsoft hasn't just made this for short play projects.
00:01:46It's made for longer context audio generation and offline too. But when adding emotion tags,
00:01:52it starts to fall apart. Unlike Chatterbox, for example, it does auto-emotion based on the words,
00:01:58and that's not actually that great. I didn't like that. Chatterbox still kind of won here.
00:02:02But if you're building things like AI podcasts, narrated docs, long-form agents,
00:02:07or just training data, this might actually do a decent job at those.
00:02:11Now let's switch gears here into real-time mode. This runs a lot faster than multi-speaker,
00:02:16which honestly took a long time to generate. This now is incremental streaming, so think
00:02:22chatbot responses, voice agents, and assistants. First, chunky latency is around 300 milliseconds,
00:02:28which is usable. It's not the fastest I've used. Let's take a listen here.
00:02:32Imagine drinking hot chocolate in Japan under cherry blossoms.
00:02:35Okay. And yes, they say it can sing or even generate background music. If you push it,
00:02:40that didn't work. I pushed it. It didn't work. But the point is here,
00:02:43is this production ready real-time? I don't think so. But for experimentation and agents,
00:02:48yeah, this is pretty good. Now the fun stuff. Let's talk about the voice cloning because that
00:02:53was really, really cool. Here was my setup for that. First, I recorded myself on voice memos.
00:02:58I'm on a Mac. I then converted that file into a WAV file, and I launched Gradio with this command.
00:03:04From this interactive interface, I can then choose my own voice as the target language.
00:03:10That's it. Just a normal recording. And what you'd expect is something close to my voice,
00:03:14but obviously fake. Take a listen to this. This is my voice cloned using vibe voice.
00:03:19It honestly sounds really good. Almost too good because I didn't say any of this. Now that did
00:03:25sound similar to me, but if you know me, then you'd probably still tell it's a fake. At least I hope
00:03:30so. Now it's not perfect, but consistent and it's stable and it's used across longer outputs. That's
00:03:36great. Microsoft says this stack can handle long form generation in a single pass and in practice
00:03:41stays noticeably more stable than whisper style pipelines once the audio gets longer, right? And
00:03:47if you've ever tried cloning a voice more than a short clip, you know why this matters. So yeah,
00:03:52the demos were impressive, I guess. I had fun with those, the voice cloning, but I went through the
00:03:56docs, the issues, some threads, and it is a mix from other devs. Now the pros first, then the stuff
00:04:02that you're going to run into. The pros here are solid for the most part. It's long form for sure,
00:04:08right? Most TTS systems drift, they flatten or they break after a few minutes. Vibe voice is made for
00:04:14longer audio and it showed here and it showed in my longer demos. Then efficiency plus expressiveness.
00:04:20It uses low frequency audio tokenizers, which keeps the context more manageable. And diffusion plus an
00:04:27LLM backbone and you get expressive speech without absurd compute. It felt a bit more dev friendly
00:04:33by design, right? This was nice. It's MIT licensed. It runs offline. It runs consumer GPUs around seven
00:04:40GB VRAM for real time. And fine tuning code is included, especially for ASR. This is not a
00:04:47lockdown of any sorts, but it's really good. Finally, like some other open source, it's structured
00:04:53ASR output. Huge win. Speaker diarization plus timestamps out of the box saves a lot of time
00:04:59downstream. If you've built transcription pipelines, you know that that's not a small thing. Now the
00:05:04drawbacks I definitely felt here and I saw them as well. This is kind of just like a research software.
00:05:11Microsoft pulls some TTS code paths due to deepfake concerns that tells you kind of everything. The SDK,
00:05:17it's not a grand slam. It's not polished, right? There are obviously some audio quirks as I found
00:05:23with other softwares. You might hear some robotic intonation. Sometimes pacing is going to feel off
00:05:28and multi-speaker scenes beyond two or three people that degrades. Devs seem to love the tokenizer and
00:05:33hate the VRAM spikes. And there is only a limited language coverage. So Chinese and English, they're
00:05:40great. But if you need any other types of languages, vibe voice is not going to be that. Lastly, the
00:05:46drawback of zero semantic understanding like this thing reads text, but it doesn't understand it.
00:05:51Emotion tags can help, but they still glitch out a lot if we're adding in those tags. So the honest
00:05:56thing here, it's an incredible tool for experimentation and things, but long-term, I'm not
00:06:02sure if this is going to hold up. Now, the answer that you actually want to know, is this worth your
00:06:06time compared to what you're already using in your workflows? How well does vibe voice stack up against
00:06:11the other usual competitors? Let's start with vibe voice versus chatterbox. I did a video and played
00:06:16around with chatterbox in the past. That was honestly really sweet. Chatterbox had sub 200
00:06:22milliseconds latency, stronger emotional punch and better short agent replies. So you'd think
00:06:28chatterbox just wins, but vibe voice destroys it on long form. Chatterbox is built for 30 minute less
00:06:35monologues or podcast style outputs and vibe voice handles that long form much better. So it's a give
00:06:42and a take with that. Then of course we have vibe voice in 11 labs. This one's simple, right? 11 labs
00:06:48wins because you have your polisher pronunciation, your zero shot voice cloning the UX, but where vibe
00:06:54voice wins is the cost. It's free. It's offline. It's open source, right? That's a huge win here.
00:07:00We're not paying for software. You have vibe voice and whisper or even cozy voice. It beats whisper
00:07:06once audio gets long and structured. It's more expressive than cozy voice and quen based TTS models
00:07:13are catching up in dialects, but vibe voice still leaves in content length. If you're a dev who builds
00:07:18locally, you like open source and you care about long form audio, I think vibe voice is worth your
00:07:23time. If you want something that's more plug and play production ready, honestly, you can probably
00:07:28skip this for now. It's just a really cool project to play around with, including that voice cloning.
00:07:33Vibe voice is messy. It's powerful, but it's also exciting. It's one of the strongest open
00:07:37source audio stacks we've seen for long form AI speech in a long time. Try the Hugging Face demo,
00:07:43read some docs, and we'll see you in another video.

Key Takeaway

Microsoft's Vibe Voice offers a powerful, MIT-licensed open-source solution for developers seeking stable, long-form voice cloning and text-to-speech that runs entirely offline on consumer hardware.

Highlights

Vibe Voice is an open-source speech stack by Microsoft designed for long-form audio generation and offline use.

It supports multi-speaker audio and can generate up to 90 minutes of content in a single pass without the typical 'drift' found in other models.

The system includes a voice cloning feature that allows users to replicate their own voice using simple WAV file recordings and a Gradio interface.

Performance metrics show a first-chunk latency of approximately 300 milliseconds for real-time streaming, which is suitable for experimentation but not yet industry-leading.

Vibe Voice runs on consumer-grade GPUs requiring roughly 7 GB of VRAM and carries an MIT license, making it highly accessible for developers.

The model excels in long-form stability compared to Whisper or 11 Labs, though it currently struggles with complex emotion tagging and a lack of semantic understanding.

Timeline

Introduction to Vibe Voice and Core Features

The speaker introduces Vibe Voice, a new open-source speech stack from Microsoft that is being compared to 11 Labs and Whisper. Its standout feature is the ability to generate up to 90 minutes of multi-speaker audio in a single pass while running entirely offline. This section explores whether the tool is truly usable for developers or if it is just a resource-heavy research project. The speaker outlines the upcoming roadmap for the video, which includes multi-speaker demos, real-time TTS tests, and voice cloning experiments. This introduction sets the stage for a technical evaluation of the tool's performance on consumer GPUs.

Multi-Speaker Performance and Long-Form Stability

This segment focuses on a podcast-style script involving three distinct speakers to test turn-taking and consistency. Unlike many TTS systems that lose coherence or 'drift' after 20 seconds, Vibe Voice maintains speaker identity and transitions effectively over longer durations. The speaker notes that while the audio quality is solid for long-form projects like AI podcasts or narrated documents, the automatic emotion tagging is a weak point. Compared to Chatterbox, which handles emotions more naturally based on word context, Vibe Voice's emotional output can feel disjointed or fall apart. Overall, it is presented as a specialized tool for length and stability rather than pure emotional nuance.

Real-Time Streaming and Technical Demos

The video transitions to 'real-time mode,' which utilizes incremental streaming to power chatbot responses and voice assistants. The speaker measures a first-chunk latency of about 300 milliseconds, noting it is usable but not the fastest on the market. An attempt to test the model's ability to sing or generate background music fails, suggesting limits to its versatility. This section highlights that while the tool is excellent for building experimental agents, it may not yet be 'production ready' for high-stakes real-time applications. It serves as a bridge between high-quality offline generation and the needs of interactive AI assistants.

Voice Cloning Process and Results

The speaker demonstrates the voice cloning workflow, starting with a simple voice memo recording on a Mac. This file is converted to a WAV format and loaded into a Gradio interface where the user can select their own voice as the target. The resulting audio is described as surprisingly good and stable, though the speaker admits a close friend might still identify it as a fake. This section emphasizes the 'single pass' advantage, which prevents the audio from breaking down during extended clips. It highlights the practical utility of Vibe Voice for creators who need consistent, long-form clones without the cost of proprietary platforms.

Pros, Cons, and Technical Requirements

This critical evaluation covers the technical advantages and drawbacks encountered by developers using the stack. On the positive side, it features an MIT license, runs offline with 7 GB of VRAM, and includes fine-tuning code for Automatic Speech Recognition (ASR). However, the speaker warns that the software is still 'research-grade' and lacks a polished SDK, with some code paths removed by Microsoft due to deepfake concerns. Significant drawbacks include limited language support (mostly English and Chinese) and a lack of semantic understanding that causes issues with emotional inflection. This section provides the necessary context for developers to understand the hardware and software limitations before diving into the repo.

Comparative Analysis and Final Verdict

The final section compares Vibe Voice against competitors like Chatterbox, 11 Labs, and Whisper. While 11 Labs offers better UX and pronunciation, Vibe Voice wins on cost (free) and the privacy of offline operation. It outperforms Whisper and Cozy Voice specifically in the context of audio length and structural stability. The speaker concludes that Vibe Voice is a 'messy but powerful' project best suited for developers who prioritize open-source flexibility and long-form content. The video ends with a recommendation to try the Hugging Face demo and explore the documentation for further experimentation.

Community Posts

View all posts