I Tried the Open Source ElevenLabs Alternative (Voicebox)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareConsumer ElectronicsInternet Technology

Transcript

00:00:00They say this is the olama of voice AI. It clones voices, generates speech, dictates into any app,

00:00:07and talks to agents in voices you actually own. This is VoiceBox, and that's what it says

00:00:13right here. It's free and a local alternative to 11 labs, and honestly, this was insane.

00:00:19It has around 30,000 stars on GitHub. It runs locally, and in the next 60 seconds,

00:00:24I'm going to show you cloning local voice generation and dictation inside an editor.

00:00:29How useful is this for us, and how easy is it to get going in the first place? We're about to find out.

00:00:39Now, VoiceBox is an open-source local AI voice studio. The simple way to think about it is this.

00:00:46Olama is for local text models. VoiceBox is trying to be that for voice. So it's not just text-to-speech.

00:00:54It does voice cloning, system-wide dictation, creative editing, and it even has stories and

00:01:00timelines, and it connects to AI agents. So this gives us real control and even more privacy.

00:01:06I want to build things without asking, how many credits did I just use to test this? VoiceBox

00:01:12doesn't ask that, because VoiceBox runs on our machine. So there's no subscription. There's no

00:01:17character limits. Plus, it brings together cloning, whisper-powered dictation, a multi-track editor,

00:01:23Atari desktop app, MCP support, and local REST API. So instead of five separate tools,

00:01:29you get one desktop app with everything right here. I'm going to do three things here in this video.

00:01:36I'm going to clone a voice, I'm going to make it speak, and then I'm going to use dictation inside the

00:01:41editor. After that, I'll show you why the agent integration is actually super sick, or at least

00:01:46we're going to talk about it. If you enjoy coding tools that speed up your workflow, be sure to

00:01:50subscribe. We have videos coming out all the time. All right, now I'm running this on my Mac M4.

00:01:55Here is VoiceBox. I already have a voice profile ready, but the flow was really simple. Now you can

00:02:02spin this up with Docker, yes, but I did that, and it took nearly 30 minutes to get the containers going.

00:02:08So for this, I opted instead to get the desktop app, which was way faster, and it's honestly really

00:02:13good. I can name the audio here. I can add a description and even tell it how to act with the

00:02:19models. Then I can either record myself speaking or upload a short file for it to analyze while also

00:02:26dropping in the transcription of that audio. Now I'll type a line that I would actually want to use. So

00:02:32maybe as a developer, this gives me complete control over voice AI without cloud costs and all that privacy

00:02:38stuff. I'll choose my voice profile. I can choose my voice profile. I can choose the model I want and hit

00:02:44generate. Now the first run of this is going to have to download the model. So it might actually take

00:02:50some time, but after all that, and we've run it, we get waveforms. Let's take a listen.

00:02:57As a developer, this gives me complete control over voice AI without cloud costs and all that privacy

00:03:02stuff. That audio was generated locally from my machine and I cloned my own voice. There was no browser tab.

00:03:09I didn't need API keys, but here's the part that feels like this is a real workflow. The system-wide

00:03:16dictation. I could hit a global hotkey and I could say whatever I'm thinking in the moment. If you like

00:03:22finding coding tools and tricks like this, check out our channel. Now it lands directly inside my editor.

00:03:29So, I mean, that was pretty useful for notes, comments, or anything like that.

00:03:33But all these little moments where talking is actually faster than typing, that's huge. This

00:03:38is not only for you talking to the computer. Your agents could actually talk back now.

00:03:43Clawed code, cursor, or your own local agent can trigger speech through voice box instead,

00:03:49instead of only just dumping it into your terminal. We're already getting feedback from our AIs.

00:03:55Why not have it speak to us? Now let's compare this with tools we already know.

00:03:59For obvious reasons, right, we have Eleven Labs. Eleven Labs is great. Bravo. I've done comparisons on that

00:04:05before. It's hosted. We know the quality is amazing. But then again, right, it's cloud-based. It's

00:04:11subscription-driven. So we're paying for that. We're putting our stuff up in the cloud.

00:04:16Voice box is the complete opposite of that. Why? Well, it's local. It's free. It's unlimited. We

00:04:22control all that data going into it. Eleven Labs may still win if you're using it all day,

00:04:27but I think I'll be keeping voice box as I loved how easy it was. And honestly, it sounds really decent

00:04:33too. For us devs, the best tool is not always the one with the prettiest output. We don't actually

00:04:38care about that a lot of the time. Sometimes it's the one you can actually control. Then there's the

00:04:43whole open source side. You could already use tools like Piper, Whisper, and a bunch of separate scripts.

00:04:50But again, the key thing there, guys, is they're all separate, right? We have one tool for transcription,

00:04:56one for cloning, one for TTS, one for UI, all this stuff that we're really just smushing together.

00:05:03Voice box packages the whole workflow into one studio app. Input, output, editing, profiles,

00:05:09documentation, agent integration, and heck, you could also use the MCP server. Like I said,

00:05:14that means Claude or Cursor can call voice box like a tool instead of your agent only replying

00:05:20with text. It now speaks back to you. But do you want to hear yourself speak back to you? I don't

00:05:25know. Maybe change the voice for that. But imagine your coding agent saying, build failed. Three test

00:05:30modules broke the auth module. That sounds not real until you realize how many times a day you're already

00:05:36getting feedback from your tools. Voice box just gives those updates an actual voice.

00:05:42So why did I like this one so much compared to others? Well, okay, privacy and cost. Honestly,

00:05:48those are the really big wins, at least for me. Those are easy wins. For voice samples, audio,

00:05:53internal content, or anything really sensitive, local first is what we want. It's great.

00:05:57Then is the agent integration, which I didn't put into the full test here, but devs are already

00:06:02talking about it as they're integrating it into Claude Code, Cursor. Voice box gives those systems

00:06:08a voice layer without needing a hosted speech provider. The workflow was pretty neat. I like

00:06:14that it's all in a UI that we can control. It's really easy. And if you're on Apple Silicon,

00:06:18especially local performance is one of the reasons that this felt so good. But here's the

00:06:23thing to keep in mind with all of this. It dropped this year. It's still early. So there's

00:06:28going to be problems. Some users are going to hit rough spots if you're on Windows, especially around

00:06:33GPU detection, model setup, and exports. If this happens, just restart the app. I have the issue

00:06:39on my Mac. Restarting it fixes this. Long form consistency can also still fall behind 11 labs.

00:06:46In emotion control, it is improving, but that depends on the model you choose. If you choose

00:06:50Shatterbox TTS Turbo, we then have those emotions built in.

00:06:55So should you install voice box? Honestly, it was super easy. It's absolutely worth trying

00:07:00because it takes away a lot of that friction that we have from workflows that we're just

00:07:04really piecing together. The main value is not just voice quality. It's really the control

00:07:09that we're given here. It's control over data, control over costs, over integration. That's

00:07:15why this all really matters. Now, getting started was dead simple. A monkey could do it. Go to

00:07:20Voicebox website or GitHub releases, download the installer for your platform, launch the app,

00:07:25and then pull the local models that you need. But the whole core idea here is really strong,

00:07:30and it's already useful enough to actually install. If you enjoy coding tools like this,

00:07:35be sure to subscribe to the BetterStack channel. We'll see you in another video.

Key Takeaway

VoiceBox provides a private, free, and local AI voice studio that centralizes cloning, transcription, and agent integration without relying on cloud-based subscriptions.

Highlights

VoiceBox serves as a local, open-source alternative to cloud-based voice AI services like ElevenLabs.
The desktop application integrates voice cloning, system-wide dictation, multi-track editing, and local API support into a single interface.
Running models locally eliminates subscription fees, character limits, and external cloud data dependencies.
Integration via MCP (Model Context Protocol) allows coding agents like Claude or Cursor to provide audible, voice-based feedback.
The software runs on devices with Apple Silicon, leveraging local hardware for performance.

Timeline

VoiceBox Features and Purpose

VoiceBox functions as a local, open-source AI voice studio.
It replaces multiple separate tools with one centralized desktop application.
The software operates entirely on local hardware, ensuring data privacy.

VoiceBox aims to do for voice AI what Ollama does for local text models. It consolidates voice cloning, Whisper-powered dictation, and multi-track editing. Because it runs locally, users avoid subscription costs and character limits while maintaining control over sensitive audio data.

Practical Workflow and Performance

Setting up via the desktop app is faster than using Docker containers.
Voice profiles are created by recording or uploading audio and providing a corresponding transcript.
System-wide dictation allows users to convert speech into text directly within editors using a global hotkey.

On a Mac M4, the desktop application proves efficient for cloning voices and generating speech. Once a model downloads, the software produces waveforms locally without needing API keys or browser tabs. The dictation feature captures thoughts quickly for code comments or notes.

Agent Integration and Comparisons

VoiceBox provides an alternative to ElevenLabs for those prioritizing local control over cloud-hosted convenience.
MCP support enables coding agents to speak feedback directly to the user.
The software streamlines workflows that previously required stitching together disparate tools like Piper and Whisper.

While ElevenLabs offers high-quality, hosted results, VoiceBox provides unlimited, private usage. Developers can use the MCP server to make tools like Claude Code or Cursor trigger voice output, turning terminal-only feedback into spoken updates.

Setup and Considerations

The software is free and requires only a download from the website or GitHub releases.
Windows users may encounter occasional GPU detection or export issues.
Restarting the application resolves most initial technical bugs.

VoiceBox is in its early stages, meaning users might experience some friction with complex setups or long-form consistency. However, the ability to choose specific models, such as Shatterbox TTS Turbo, allows for better emotion control. For those seeking privacy and integration, it is a highly viable tool.

Community Posts

Write about this video