I Tried the Open Source ElevenLabs Alternative (Voicebox)

BBetter Stack
Computing/SoftwareConsumer ElectronicsInternet Technology

Transcript

00:00:00They say this is the olama of voice AI. It clones voices, generates speech, dictates into any app,
00:00:07and talks to agents in voices you actually own. This is VoiceBox, and that's what it says
00:00:13right here. It's free and a local alternative to 11 labs, and honestly, this was insane.
00:00:19It has around 30,000 stars on GitHub. It runs locally, and in the next 60 seconds,
00:00:24I'm going to show you cloning local voice generation and dictation inside an editor.
00:00:29How useful is this for us, and how easy is it to get going in the first place? We're about to find out.
00:00:39Now, VoiceBox is an open-source local AI voice studio. The simple way to think about it is this.
00:00:46Olama is for local text models. VoiceBox is trying to be that for voice. So it's not just text-to-speech.
00:00:54It does voice cloning, system-wide dictation, creative editing, and it even has stories and
00:01:00timelines, and it connects to AI agents. So this gives us real control and even more privacy.
00:01:06I want to build things without asking, how many credits did I just use to test this? VoiceBox
00:01:12doesn't ask that, because VoiceBox runs on our machine. So there's no subscription. There's no
00:01:17character limits. Plus, it brings together cloning, whisper-powered dictation, a multi-track editor,
00:01:23Atari desktop app, MCP support, and local REST API. So instead of five separate tools,
00:01:29you get one desktop app with everything right here. I'm going to do three things here in this video.
00:01:36I'm going to clone a voice, I'm going to make it speak, and then I'm going to use dictation inside the
00:01:41editor. After that, I'll show you why the agent integration is actually super sick, or at least
00:01:46we're going to talk about it. If you enjoy coding tools that speed up your workflow, be sure to
00:01:50subscribe. We have videos coming out all the time. All right, now I'm running this on my Mac M4.
00:01:55Here is VoiceBox. I already have a voice profile ready, but the flow was really simple. Now you can
00:02:02spin this up with Docker, yes, but I did that, and it took nearly 30 minutes to get the containers going.
00:02:08So for this, I opted instead to get the desktop app, which was way faster, and it's honestly really
00:02:13good. I can name the audio here. I can add a description and even tell it how to act with the
00:02:19models. Then I can either record myself speaking or upload a short file for it to analyze while also
00:02:26dropping in the transcription of that audio. Now I'll type a line that I would actually want to use. So
00:02:32maybe as a developer, this gives me complete control over voice AI without cloud costs and all that privacy
00:02:38stuff. I'll choose my voice profile. I can choose my voice profile. I can choose the model I want and hit
00:02:44generate. Now the first run of this is going to have to download the model. So it might actually take
00:02:50some time, but after all that, and we've run it, we get waveforms. Let's take a listen.
00:02:57As a developer, this gives me complete control over voice AI without cloud costs and all that privacy
00:03:02stuff. That audio was generated locally from my machine and I cloned my own voice. There was no browser tab.
00:03:09I didn't need API keys, but here's the part that feels like this is a real workflow. The system-wide
00:03:16dictation. I could hit a global hotkey and I could say whatever I'm thinking in the moment. If you like
00:03:22finding coding tools and tricks like this, check out our channel. Now it lands directly inside my editor.
00:03:29So, I mean, that was pretty useful for notes, comments, or anything like that.
00:03:33But all these little moments where talking is actually faster than typing, that's huge. This
00:03:38is not only for you talking to the computer. Your agents could actually talk back now.
00:03:43Clawed code, cursor, or your own local agent can trigger speech through voice box instead,
00:03:49instead of only just dumping it into your terminal. We're already getting feedback from our AIs.
00:03:55Why not have it speak to us? Now let's compare this with tools we already know.
00:03:59For obvious reasons, right, we have Eleven Labs. Eleven Labs is great. Bravo. I've done comparisons on that
00:04:05before. It's hosted. We know the quality is amazing. But then again, right, it's cloud-based. It's
00:04:11subscription-driven. So we're paying for that. We're putting our stuff up in the cloud.
00:04:16Voice box is the complete opposite of that. Why? Well, it's local. It's free. It's unlimited. We
00:04:22control all that data going into it. Eleven Labs may still win if you're using it all day,
00:04:27but I think I'll be keeping voice box as I loved how easy it was. And honestly, it sounds really decent
00:04:33too. For us devs, the best tool is not always the one with the prettiest output. We don't actually
00:04:38care about that a lot of the time. Sometimes it's the one you can actually control. Then there's the
00:04:43whole open source side. You could already use tools like Piper, Whisper, and a bunch of separate scripts.
00:04:50But again, the key thing there, guys, is they're all separate, right? We have one tool for transcription,
00:04:56one for cloning, one for TTS, one for UI, all this stuff that we're really just smushing together.
00:05:03Voice box packages the whole workflow into one studio app. Input, output, editing, profiles,
00:05:09documentation, agent integration, and heck, you could also use the MCP server. Like I said,
00:05:14that means Claude or Cursor can call voice box like a tool instead of your agent only replying
00:05:20with text. It now speaks back to you. But do you want to hear yourself speak back to you? I don't
00:05:25know. Maybe change the voice for that. But imagine your coding agent saying, build failed. Three test
00:05:30modules broke the auth module. That sounds not real until you realize how many times a day you're already
00:05:36getting feedback from your tools. Voice box just gives those updates an actual voice.
00:05:42So why did I like this one so much compared to others? Well, okay, privacy and cost. Honestly,
00:05:48those are the really big wins, at least for me. Those are easy wins. For voice samples, audio,
00:05:53internal content, or anything really sensitive, local first is what we want. It's great.
00:05:57Then is the agent integration, which I didn't put into the full test here, but devs are already
00:06:02talking about it as they're integrating it into Claude Code, Cursor. Voice box gives those systems
00:06:08a voice layer without needing a hosted speech provider. The workflow was pretty neat. I like
00:06:14that it's all in a UI that we can control. It's really easy. And if you're on Apple Silicon,
00:06:18especially local performance is one of the reasons that this felt so good. But here's the
00:06:23thing to keep in mind with all of this. It dropped this year. It's still early. So there's
00:06:28going to be problems. Some users are going to hit rough spots if you're on Windows, especially around
00:06:33GPU detection, model setup, and exports. If this happens, just restart the app. I have the issue
00:06:39on my Mac. Restarting it fixes this. Long form consistency can also still fall behind 11 labs.
00:06:46In emotion control, it is improving, but that depends on the model you choose. If you choose
00:06:50Shatterbox TTS Turbo, we then have those emotions built in.
00:06:55So should you install voice box? Honestly, it was super easy. It's absolutely worth trying
00:07:00because it takes away a lot of that friction that we have from workflows that we're just
00:07:04really piecing together. The main value is not just voice quality. It's really the control
00:07:09that we're given here. It's control over data, control over costs, over integration. That's
00:07:15why this all really matters. Now, getting started was dead simple. A monkey could do it. Go to
00:07:20Voicebox website or GitHub releases, download the installer for your platform, launch the app,
00:07:25and then pull the local models that you need. But the whole core idea here is really strong,
00:07:30and it's already useful enough to actually install. If you enjoy coding tools like this,
00:07:35be sure to subscribe to the BetterStack channel. We'll see you in another video.

Key Takeaway

VoiceBox provides a private, free, and local AI voice studio that centralizes cloning, transcription, and agent integration without relying on cloud-based subscriptions.

Highlights

  • VoiceBox serves as a local, open-source alternative to cloud-based voice AI services like ElevenLabs.

  • The desktop application integrates voice cloning, system-wide dictation, multi-track editing, and local API support into a single interface.

  • Running models locally eliminates subscription fees, character limits, and external cloud data dependencies.

  • Integration via MCP (Model Context Protocol) allows coding agents like Claude or Cursor to provide audible, voice-based feedback.

  • The software runs on devices with Apple Silicon, leveraging local hardware for performance.

Timeline

VoiceBox Features and Purpose

  • VoiceBox functions as a local, open-source AI voice studio.
  • It replaces multiple separate tools with one centralized desktop application.
  • The software operates entirely on local hardware, ensuring data privacy.

VoiceBox aims to do for voice AI what Ollama does for local text models. It consolidates voice cloning, Whisper-powered dictation, and multi-track editing. Because it runs locally, users avoid subscription costs and character limits while maintaining control over sensitive audio data.

Practical Workflow and Performance

  • Setting up via the desktop app is faster than using Docker containers.
  • Voice profiles are created by recording or uploading audio and providing a corresponding transcript.
  • System-wide dictation allows users to convert speech into text directly within editors using a global hotkey.

On a Mac M4, the desktop application proves efficient for cloning voices and generating speech. Once a model downloads, the software produces waveforms locally without needing API keys or browser tabs. The dictation feature captures thoughts quickly for code comments or notes.

Agent Integration and Comparisons

  • VoiceBox provides an alternative to ElevenLabs for those prioritizing local control over cloud-hosted convenience.
  • MCP support enables coding agents to speak feedback directly to the user.
  • The software streamlines workflows that previously required stitching together disparate tools like Piper and Whisper.

While ElevenLabs offers high-quality, hosted results, VoiceBox provides unlimited, private usage. Developers can use the MCP server to make tools like Claude Code or Cursor trigger voice output, turning terminal-only feedback into spoken updates.

Setup and Considerations

  • The software is free and requires only a download from the website or GitHub releases.
  • Windows users may encounter occasional GPU detection or export issues.
  • Restarting the application resolves most initial technical bugs.

VoiceBox is in its early stages, meaning users might experience some friction with complex setups or long-form consistency. However, the ability to choose specific models, such as Shatterbox TTS Turbo, allows for better emotion control. For those seeking privacy and integration, it is a highly viable tool.

Community Posts

View all posts