I Tried the Open Source ElevenLabs Alternative (Voicebox)
BBetter Stack
Computing/SoftwareConsumer ElectronicsInternet Technology
Transcript
00:00:00They say this is the olama of voice AI. It clones voices, generates speech, dictates into any app,
00:00:07and talks to agents in voices you actually own. This is VoiceBox, and that's what it says
00:00:13right here. It's free and a local alternative to 11 labs, and honestly, this was insane.
00:00:19It has around 30,000 stars on GitHub. It runs locally, and in the next 60 seconds,
00:00:24I'm going to show you cloning local voice generation and dictation inside an editor.
00:00:29How useful is this for us, and how easy is it to get going in the first place? We're about to find out.
00:00:39Now, VoiceBox is an open-source local AI voice studio. The simple way to think about it is this.
00:00:46Olama is for local text models. VoiceBox is trying to be that for voice. So it's not just text-to-speech.
00:00:54It does voice cloning, system-wide dictation, creative editing, and it even has stories and
00:01:00timelines, and it connects to AI agents. So this gives us real control and even more privacy.
00:01:06I want to build things without asking, how many credits did I just use to test this? VoiceBox
00:01:12doesn't ask that, because VoiceBox runs on our machine. So there's no subscription. There's no
00:01:17character limits. Plus, it brings together cloning, whisper-powered dictation, a multi-track editor,
00:01:23Atari desktop app, MCP support, and local REST API. So instead of five separate tools,
00:01:29you get one desktop app with everything right here. I'm going to do three things here in this video.
00:01:36I'm going to clone a voice, I'm going to make it speak, and then I'm going to use dictation inside the
00:01:41editor. After that, I'll show you why the agent integration is actually super sick, or at least
00:01:46we're going to talk about it. If you enjoy coding tools that speed up your workflow, be sure to
00:01:50subscribe. We have videos coming out all the time. All right, now I'm running this on my Mac M4.
00:01:55Here is VoiceBox. I already have a voice profile ready, but the flow was really simple. Now you can
00:02:02spin this up with Docker, yes, but I did that, and it took nearly 30 minutes to get the containers going.
00:02:08So for this, I opted instead to get the desktop app, which was way faster, and it's honestly really
00:02:13good. I can name the audio here. I can add a description and even tell it how to act with the
00:02:19models. Then I can either record myself speaking or upload a short file for it to analyze while also
00:02:26dropping in the transcription of that audio. Now I'll type a line that I would actually want to use. So
00:02:32maybe as a developer, this gives me complete control over voice AI without cloud costs and all that privacy
00:02:38stuff. I'll choose my voice profile. I can choose my voice profile. I can choose the model I want and hit
00:02:44generate. Now the first run of this is going to have to download the model. So it might actually take
00:02:50some time, but after all that, and we've run it, we get waveforms. Let's take a listen.
00:02:57As a developer, this gives me complete control over voice AI without cloud costs and all that privacy
00:03:02stuff. That audio was generated locally from my machine and I cloned my own voice. There was no browser tab.
00:03:09I didn't need API keys, but here's the part that feels like this is a real workflow. The system-wide
00:03:16dictation. I could hit a global hotkey and I could say whatever I'm thinking in the moment. If you like
00:03:22finding coding tools and tricks like this, check out our channel. Now it lands directly inside my editor.
00:03:29So, I mean, that was pretty useful for notes, comments, or anything like that.
00:03:33But all these little moments where talking is actually faster than typing, that's huge. This
00:03:38is not only for you talking to the computer. Your agents could actually talk back now.
00:03:43Clawed code, cursor, or your own local agent can trigger speech through voice box instead,
00:03:49instead of only just dumping it into your terminal. We're already getting feedback from our AIs.
00:03:55Why not have it speak to us? Now let's compare this with tools we already know.
00:03:59For obvious reasons, right, we have Eleven Labs. Eleven Labs is great. Bravo. I've done comparisons on that
00:04:05before. It's hosted. We know the quality is amazing. But then again, right, it's cloud-based. It's
00:04:11subscription-driven. So we're paying for that. We're putting our stuff up in the cloud.
00:04:16Voice box is the complete opposite of that. Why? Well, it's local. It's free. It's unlimited. We
00:04:22control all that data going into it. Eleven Labs may still win if you're using it all day,
00:04:27but I think I'll be keeping voice box as I loved how easy it was. And honestly, it sounds really decent
00:04:33too. For us devs, the best tool is not always the one with the prettiest output. We don't actually
00:04:38care about that a lot of the time. Sometimes it's the one you can actually control. Then there's the
00:04:43whole open source side. You could already use tools like Piper, Whisper, and a bunch of separate scripts.
00:04:50But again, the key thing there, guys, is they're all separate, right? We have one tool for transcription,
00:04:56one for cloning, one for TTS, one for UI, all this stuff that we're really just smushing together.
00:05:03Voice box packages the whole workflow into one studio app. Input, output, editing, profiles,
00:05:09documentation, agent integration, and heck, you could also use the MCP server. Like I said,
00:05:14that means Claude or Cursor can call voice box like a tool instead of your agent only replying
00:05:20with text. It now speaks back to you. But do you want to hear yourself speak back to you? I don't
00:05:25know. Maybe change the voice for that. But imagine your coding agent saying, build failed. Three test
00:05:30modules broke the auth module. That sounds not real until you realize how many times a day you're already
00:05:36getting feedback from your tools. Voice box just gives those updates an actual voice.
00:05:42So why did I like this one so much compared to others? Well, okay, privacy and cost. Honestly,
00:05:48those are the really big wins, at least for me. Those are easy wins. For voice samples, audio,
00:05:53internal content, or anything really sensitive, local first is what we want. It's great.
00:05:57Then is the agent integration, which I didn't put into the full test here, but devs are already
00:06:02talking about it as they're integrating it into Claude Code, Cursor. Voice box gives those systems
00:06:08a voice layer without needing a hosted speech provider. The workflow was pretty neat. I like
00:06:14that it's all in a UI that we can control. It's really easy. And if you're on Apple Silicon,
00:06:18especially local performance is one of the reasons that this felt so good. But here's the
00:06:23thing to keep in mind with all of this. It dropped this year. It's still early. So there's
00:06:28going to be problems. Some users are going to hit rough spots if you're on Windows, especially around
00:06:33GPU detection, model setup, and exports. If this happens, just restart the app. I have the issue
00:06:39on my Mac. Restarting it fixes this. Long form consistency can also still fall behind 11 labs.
00:06:46In emotion control, it is improving, but that depends on the model you choose. If you choose
00:06:50Shatterbox TTS Turbo, we then have those emotions built in.
00:06:55So should you install voice box? Honestly, it was super easy. It's absolutely worth trying
00:07:00because it takes away a lot of that friction that we have from workflows that we're just
00:07:04really piecing together. The main value is not just voice quality. It's really the control
00:07:09that we're given here. It's control over data, control over costs, over integration. That's
00:07:15why this all really matters. Now, getting started was dead simple. A monkey could do it. Go to
00:07:20Voicebox website or GitHub releases, download the installer for your platform, launch the app,
00:07:25and then pull the local models that you need. But the whole core idea here is really strong,
00:07:30and it's already useful enough to actually install. If you enjoy coding tools like this,
00:07:35be sure to subscribe to the BetterStack channel. We'll see you in another video.