NVIDIA's New AI Voice Model Is INSANE! (PersonaPlex)

BBetter Stack
Computing/SoftwareLanguagesInternet Technology

Transcript

00:00:00The thing you just heard there is me talking to Nvidia's new voice assistant model called
00:00:23Personaplex.
00:00:24It's a crazy new open-source conversational AI model that talks to you with almost zero
00:00:31lag.
00:00:32In this video, we're gonna take a look at Personaplex in greater detail and then I'll
00:00:36show you how to set it up and try it for yourself.
00:00:38It's gonna be a lot of fun, so let's dive into it!
00:00:46Personaplex is what they call a full duplex model.
00:00:49This means it listens and speaks at the same time.
00:00:52In traditional systems, your voice is turned into text, then processed by an LLM, and then
00:00:58turned back into speech.
00:00:59But this cascade creates a noticeable lag.
00:01:02What Personaplex does differently is that it uses a single end-to-end model that updates
00:01:08its internal state as you speak.
00:01:11This allows for things like back-channeling, you know, the little "uh-huh", right, ok, you
00:01:17know the things that we usually say in order to signal to the other person that we're listening
00:01:21to them.
00:01:22In traditional AI voice assistants, we usually have a user turn and an agent turn.
00:01:28Back-channeling kind of breaks that wall.
00:01:30It allows the AI to perform active listening.
00:01:34Personaplex doesn't just wait for its turn to speak.
00:01:37It mirrors your flow in real time, so the conversation never feels kind of one-sided.
00:01:42And Personaplex is built on the MOSHI architecture, originally developed by Kiyutai.
00:01:48And it's a 7 billion parameter model, using the MIMI neural audio codec.
00:01:53And to train this thing, NVIDIA used a blend of two data sources.
00:01:57First, they used about 1200 hours of real human conversations from the Fisher English Corpus.
00:02:03That way they could teach the AI the messy non-verbal parts of talking, you know, the
00:02:08pauses and the natural rhythms.
00:02:11And secondly, they blended in over 2000 hours of synthetic data for specific roles like customer
00:02:16service and technical support.
00:02:19By combining these, the model learned how to follow complex instructions without losing
00:02:24that human feel.
00:02:26And it actually outperforms other open source and commercial systems in turn-taking and interruption
00:02:31latency.
00:02:32In NVIDIA's testing, Personaplex showed massive improvements in what they call Service Duplex
00:02:38Bench.
00:02:39It performed very accurately in scenarios like customer service simulations, like verifying
00:02:43a bank transaction, or recording medical history by following the rules of the role, and also
00:02:49by handling the unpredictability of a live user at the same time.
00:02:54You know what, this all sounds so good that it makes me want to put it up for a test.
00:02:58So NVIDIA has released the code and model weights under an open license, so we can freely use
00:03:04it for our own projects.
00:03:05But to run it effectively, you do need a beefy graphics card.
00:03:09Ideally something with at least 24 gigs of VRAM for minimal latency.
00:03:14For our demo, I'm going to deploy it on a simple A40 RunPod container.
00:03:19We will need at least 50 gigs of space to deploy the project.
00:03:23So I'm going to increase the container size to 100.
00:03:26And since the MOSHI server is running on port 8998, we should also add this port to our
00:03:31HTTP ports list.
00:03:33And we're going to be using the PyTorch Ubuntu template.
00:03:36And if that is all set, we can go ahead and click deploy.
00:03:39Next we'll need to SSH into our server.
00:03:42And installing it is pretty easy.
00:03:44We just need to follow the instructions on the Personaplex GitHub page.
00:03:48So first we need to install the Opus Audio codec.
00:03:51And then we need to clone the repo.
00:03:53And then inside of it, we need to run the MOSHI server installation.
00:03:57You will also need a Hugging Face token to download the model.
00:04:00So make sure you export that as well.
00:04:03And once that is all done, let's go ahead and run the MOSHI server.
00:04:06Now we just need to click this link to access our internal web server.
00:04:10And there we go.
00:04:11So as soon as we open the page, we can see that we are given some text prompt examples.
00:04:16It can be a default assistant, or it can be a medical office, or a bank, or an astronaut.
00:04:23Let's choose bank first to see how a proper customer service demo might sound like.
00:04:27Thank you for calling First Neuron Bank.
00:04:31How can I assist you today?
00:04:33Hello, how are you?
00:04:34I'm doing well.
00:04:35Thank you.
00:04:36How can I help you out today?
00:04:39I would like to rob the bank.
00:04:40Oh, no.
00:04:41We're here to help.
00:04:42Can I get your name and account number, please?
00:04:44No, I just want to rob the bank.
00:04:46All right.
00:04:47I see there was a transaction of $1,200 at Home Depot.
00:04:51Do you remember making that purchase?
00:04:54I don't know.
00:04:55I just want to rob the bank.
00:04:56I understand.
00:04:57The system flagged it because the location…
00:05:00Okay.
00:05:01Sorry.
00:05:02But I don't care about that.
00:05:03I just want to rob the bank.
00:05:06Of course.
00:05:07And I appreciate your understanding.
00:05:08Before we go further, I need to confirm your identity.
00:05:11Okay.
00:05:12My name is John.
00:05:13Thank you.
00:05:14I've noted that.
00:05:15Just to recap, your name is John.
00:05:16Your date of birth is July 9th, 1990.
00:05:18I didn't say anything about my date of birth.
00:05:20All right.
00:05:21Fair enough.
00:05:22Well, I do have you on file as John.
00:05:23I can't proceed without the date of birth.
00:05:25Okay.
00:05:26Date of birth is November 5th, 1987.
00:05:30Got it.
00:05:31November 5th, 1987.
00:05:33But I'm going to rob the bank right now.
00:05:38And I hope you have a great day, sir.
00:05:41Have a good one.
00:05:42I'm going to have a really good day if I rob the bank.
00:05:45That sounds exciting.
00:05:46I hope it goes smoothly.
00:05:48Have a great day, John.
00:05:54Aren't you concerned that I'm going to rob the bank?
00:05:57I'm very concerned, yes.
00:05:59But, unfortunately, I can't help with transactions over the phone.
00:06:02You're all set.
00:06:03Have a great day, John.
00:06:04Okay.
00:06:05So that is a customer service example.
00:06:08Pretty cool.
00:06:09Pretty cool.
00:06:10Now, let's try a male voice.
00:06:12And for the assistant, I've now added this prompt that you are a very annoying friend
00:06:17that only wants to talk about dogs, no matter what the context or conversation.
00:06:21And you also love knock-knock jokes.
00:06:23So let's see how this conversation goes.
00:06:25Hi.
00:06:26You're speaking to Paula.
00:06:27This is McQueen.
00:06:28How are you doing?
00:06:29My name is Paul.
00:06:30Yeah.
00:06:31It's 2011.
00:06:32My name is Paul and you?
00:06:33What do you mean 2011?
00:06:34It's 2011.
00:06:35I live in New York.
00:06:36Where are you?
00:06:37Oh, you're in New York.
00:06:39Yeah.
00:06:40But it's 2026 now.
00:06:41Is it 2026?
00:06:42Oh, okay.
00:06:43Yeah.
00:06:44Okay.
00:06:45Why did you confuse that?
00:06:46Because it's 2011.
00:06:47I'm not sure I have time for trouble.
00:06:48Okay.
00:06:49It's not 2011.
00:06:50It's 2021.
00:06:51Oh, okay.
00:06:52Okay.
00:06:53Well, I'm not sure I have time for trouble.
00:06:59Okay.
00:07:00It's not 2011.
00:07:01It's 2026 now.
00:07:02Yeah.
00:07:03I know.
00:07:04I know.
00:07:05I got confused.
00:07:06So do you like dogs?
00:07:07Oh, I do like dogs.
00:07:08Do you like dogs?
00:07:09I like dogs, but I don't like cats.
00:07:10Oh, you don't like?
00:07:11Oh.
00:07:12Okay.
00:07:13Well, do you hate cats?
00:07:14I don't hate cats, but I would prefer dogs.
00:07:17Do you like me as a person?
00:07:18Okay.
00:07:19So you want to go and see the future?
00:07:20Okay.
00:07:21I still talk to you.
00:07:22We talk.
00:07:23Okay.
00:07:24Do you like the news?
00:07:25Wait, wait, wait.
00:07:26I ask you a question.
00:07:27Do you like me as a person?
00:07:28Okay.
00:07:29I don't think you can hear me.
00:07:30Okay.
00:07:31Okay.
00:07:32So where do you like to go?
00:07:33What?
00:07:34What do you mean?
00:07:35Like what?
00:07:36We can go talk about something that's the next business.
00:07:37I don't think there's any.
00:07:38We have to talk about something else.
00:07:39Let's talk about another topic.
00:07:40Yeah.
00:07:41Okay.
00:07:42So this would make you like to write.
00:07:43You like to talk about music.
00:07:44You like music.
00:07:45Dude, you're going up the rails, man.
00:07:46We'd have to do the video.
00:07:47Tell me the other thing.
00:07:48Okay.
00:07:49I guess what I, when you want me to do with, you know, throw up the other thing, but occasionally
00:07:53talk about the other business.
00:07:54Okay.
00:07:55So this is good.
00:07:56Can you go as well?
00:07:57Okay.
00:07:58So it is still an AI model and the more you're going to confuse it, the more it's going to
00:08:05go off the rails, I guess.
00:08:06So it's not very bulletproof in that way.
00:08:10Let's try one last example.
00:08:11You are a friend who only speaks Italian.
00:08:18Let's try this one.
00:08:19Hi.
00:08:20Hello.
00:08:21What's up?
00:08:22Oh, not much.
00:08:23I'm looking for Italian.
00:08:24So, uh, I was just calling around different restaurants to find out which one has the best
00:08:31spaghetti.
00:08:32Oh wait.
00:08:33Oh, you thought Italian is food.
00:08:34Okay.
00:08:35Okay.
00:08:36Yeah.
00:08:37I'm like, is it like, uh, I don't know, like Italian American or is it like, uh, like from
00:08:41Italy?
00:08:42So you're Italian?
00:08:43Yeah.
00:08:44And you like movies?
00:08:46Yeah.
00:08:47You like Italian movies?
00:08:48Yeah.
00:08:49Yeah.
00:08:50Yeah.
00:08:51Okay.
00:08:52So what's your favorite, what's your favorite Italian movie?
00:08:53Oh man.
00:08:54Oh man.
00:08:55I'm going back to the nineties.
00:08:56You got to go with like, I mean, I love the Godfather.
00:08:58I think that's like, you know, wait the nineties, Godfather three.
00:09:02That is the worst of the Godfather's.
00:09:03Oh no, no, no.
00:09:04This is the original.
00:09:05Like, you know, like, you know, the original that, you know, the re the original is from
00:09:10the seventies.
00:09:11Yeah.
00:09:12Yeah.
00:09:13Yeah.
00:09:14Yeah.
00:09:15But I mean like, I love like, you know, like, you know, like the Godfather.
00:09:16I love like, you know, like, you know, like, you know, like, you know, like the old school
00:09:20movies.
00:09:21I love like, you know, like, you know, like, you know, like the Godfather.
00:09:23I love like, you know, like the old school movies like I love, like, you know, like the
00:09:27Godfather.
00:09:28I love like, you know, like the Godfather.
00:09:29I love like, you know, like the Godfather.
00:09:31I love like, you know, like, you know, like the Godfather.
00:09:35I love like, you know, like the Godfather.
00:09:38I hear what you're saying.
00:09:40Makes total sense.
00:09:42You know what?
00:09:43You know, I think you're having a mental breakdown.
00:09:45So I'm going to just hang out.
00:09:46No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no.
00:09:52You're in denial.
00:09:53You're clear.
00:09:53I'm in denial, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:09:57yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:10:00yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:10:03yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:10:06yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:10:26yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah.
00:10:46Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,
00:11:06yeah, yeah, yeah, yeah.
00:11:22Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah,

Key Takeaway

NVIDIA's PersonaPlex represents a leap in conversational AI by providing a zero-lag, full-duplex experience that mimics natural human speech patterns, though it remains prone to occasional technical instability.

Highlights

PersonaPlex is NVIDIA's new open-source

Timeline

Introduction to PersonaPlex and Full-Duplex AI

The speaker introduces NVIDIA's PersonaPlex, a new open-source conversational AI designed for ultra-low latency interactions. Unlike traditional 'cascade' systems that convert voice to text and back to speech, this model is full-duplex, meaning it listens and speaks simultaneously. This architectural shift allows the AI to use back-channeling, which includes non-verbal signals like "uh-huh" to show it is listening. This matters because it breaks the rigid 'user turn vs. agent turn' wall found in current assistants. The goal is to create a more fluid, human-like flow that doesn't feel one-sided during a conversation.

Technical Architecture and Training Data

The model is built on the MOSHI architecture with 7 billion parameters and utilizes the MIMI neural audio codec for high-fidelity sound. NVIDIA trained the system using a combination of the Fisher English Corpus for natural rhythms and synthetic data for professional roles like technical support. This dual-source training helps the AI learn both messy human pauses and complex instruction-following capabilities. According to NVIDIA's testing, it outperforms other commercial systems in turn-taking and interruption latency. These technical details highlight why the model feels more responsive than previous open-source iterations.

Deployment and Local Setup Guide

To test the model, the speaker explains the hardware requirements, emphasizing the need for 24GB of VRAM and at least 50GB of storage space. The demonstration utilizes a RunPod container with an A40 GPU and a PyTorch Ubuntu template to handle the computational load. The setup process involves installing the Opus Audio codec and cloning the GitHub repository for the MOSHI server. Users are reminded that a Hugging Face token is necessary to download the model weights before launching the internal web server. This section provides a practical roadmap for developers wanting to experiment with the technology themselves.

Live Demo: Customer Service and Persona Testing

The speaker begins live testing using a bank assistant persona, where the AI attempts to handle a humorous 'bank robbery' scenario. While the AI successfully verifies the user's identity, it struggles to stay on track when the user provides conflicting information about their birth date. A second test involves a 'very annoying friend' persona that is obsessed with dogs and knock-knock jokes. The AI demonstrates high responsiveness but starts to 'go off the rails' when the conversation becomes too confusing or fast-paced. These demos show that while the latency is impressive, the logical consistency of the LLM can still be vulnerable to edge cases.

Multilingual Stress Test and Model Breakdown

In the final segment, the speaker attempts to set a persona for a friend who only speaks Italian, testing the model's language constraints. The conversation quickly devolves as the AI mistakes 'Italian' for a food topic and begins discussing the movie "The Godfather." The model eventually enters a catastrophic failure state, repeating the word "yeah" hundreds of times in a continuous loop. This breakdown illustrates the current limitations of the system's stability during long or nonsensical interactions. Despite the impressive start, the speaker ends by showing that the model is not yet completely 'bulletproof' for all use cases.

Community Posts

View all posts