00:00:00This is SpeechBrain, an open-source PyTorch-native toolkit that lets us build and ship speech
00:00:05AI features using pre-trained models. From things like noise removal, speaker verification,
00:00:10and ASR. No training and no fine-tuning. Some quick audio verification here. You're probably
00:00:15expecting some better audio. Well, yes, that does happen naturally here. According to this,
00:00:19I'm not the same person, and that's because I'm using a voice transformer in the second clip.
00:00:23So voice verification does work. Now let's see what else this can do. We have videos coming out
00:00:28all the time. Be sure to subscribe. Quick breakdown before I run the first few demos.
00:00:38SpeechBrain has ASR enhancement, separation, speaker ID, TTS, really just the whole stack.
00:00:44And here's the part that matters if you actually build stuff. 9000+ GitHub stars, tight hugging face
00:00:51integration, one-line install, and loading a model is a few more. This is built for people who want
00:00:56to ship faster, not waste time reading docs. So here's the starting code I expanded on to get
00:01:02this running. And a lot of the code I did find on the documentation site themselves. I chose to use
00:01:08Gradio for this to build out the UI. Gradio is just a Python ML app library that works really
00:01:14well for this kind of stuff. Okay, this part looks fake if you haven't seen it. Most enhancement demos
00:01:20cheat with perfect audio. I'm going to do the opposite here. I'm going to blast some background
00:01:24noise right now. Mostly just music. Here we go. I'm talking normally, recording myself speaking
00:01:31over this music. Here's the raw audio. Yeah, it sounds pretty bad. Now watch the enhanced output.
00:01:37I'm talking normally. Same voice, noise stripped out, no post-processing hacks. And here's the
00:01:44takeaway. This runs in seconds. Drop it into call apps, podcasts, cleanups, edge devices,
00:01:51anything with a mic and bad acoustics. The code, load the model, call enhanced batch, that's it.
00:01:57But the docs were honestly a bit rough, so I had to expand the code out to work better as I'm on a Mac.
00:02:02It kept running into some issues. Next up we have speaker verification, which I did touch on at the
00:02:07start here. And just to set expectations, people hear voice off and assume it's complicated. News
00:02:13flash, it's actually not, at least not with this. I'm going to enroll my voice here. Hey, this is my
00:02:20voice. That was on the first recording. Then I'm going to do the same thing again on the second here.
00:02:26Hey, this is my voice. Now verify, same speaker. The score is high. The match confirmed that. We have
00:02:36that score. We have that ranking in the output. If I do a double take without using a voice transformer,
00:02:42let's see how this is now. What did you have for breakfast? Okay, now let me change the tone. Don't
00:02:48laugh at me too much here. What did you have for breakfast? The similarity score tanks a little more,
00:02:56but it still outputs that I am the same speaker indeed. This is pre-trained on Vox
00:03:01celeb. Again, quick with the voice transformer here. This is my normal voice. Now if I switch
00:03:08on my voice transformer, this is my normal voice. Just to play it back for you guys, the second clip
00:03:17sounds a little bit like this. This is my normal voice. All right, that's a bit rough, right? You
00:03:22can hear that transformer. Yeah, they do not match at all, and this does check out here in the output.
00:03:27If you're building voice off multi-user apps or anything that needs who's talking answered,
00:03:32this is exactly for that. In my final demo here, yeah, this is meant to be the backbone. The live
00:03:37transcription ASR demos usually sound impressive until you try with this speech. Now I'm just going
00:03:43to talk normally. This feature doesn't work that well, actually, and the documentation didn't help
00:03:48much, so I don't know how I actually feel about this. This honestly just feels like normal speech
00:03:53to text. It should have auto-subscribed but ran into countless issues, and it doesn't even do
00:03:58that. So yes, it does transcribe, but so do other countless libraries. This feature here wasn't
00:04:04impressive, at least for me getting it to auto-transcribe. It just didn't work. So
00:04:08there are some really cool things here, right? We saw the voice verification, the noise background
00:04:13cancellation, but certain things are just not tweaked yet. That's really Speech Brain wrapped
00:04:18up. Overall, it's still fast. It's still open. It's still built for developers. You guys can
00:04:22check it out for yourselves. I put the links in the description, and we will see you guys in another
00:04:26video.