Google’s New AI Generates 1,000+ Tokens Per Second (DiffusionGemma)
BBetter Stack
Computing/SoftwareVideo & Computer Games
Transcript
00:00:00Google has been on fire lately. Last week I did a video on their groundbreaking encoder-free
00:00:05Gemma 4 model and this week they dropped another shockingly innovative model. It's called Diffusion
00:00:11Gemma and this model is blazingly fast. It is capable of generating more than a thousand tokens
00:00:18per second and the reason why it's able to do that is because it generates text in a completely
00:00:23different way than any other model you've ever used before. So in this video we'll take a look
00:00:29at Diffusion Gemma, see how it works and I'll show you how you can test it out for yourself as well.
00:00:35It's gonna be a lot of fun so let's dive into it.
00:00:41So every language model you've ever talked to works the same fundamental way. They're auto-aggressive
00:00:48and that's a fancy word for saying they generate one token at a time left to right. They write a word
00:00:54then they look at everything written so far and then they predict the next word and the cycle just
00:00:59repeats. And the way it works for large commercial models like Claude or GPT is that when a server
00:01:06generates a token, most of the time isn't spent on computing, it's spent loading the model's weights
00:01:12out of memory. And that's kind of wasteful if you're serving just one user. So the servers batch hundreds
00:01:19of users together load the weights once and run them against everybody in the same time. And that way,
00:01:25you can serve 256 users with one memory load. But when you run a model locally, you're just one user,
00:01:33so there's nobody to batch you with. The GPU loads the massive portion of weights, does a tiny little
00:01:39computation to produce one token, and then it sits there idle before doing it all again. In technical terms,
00:01:46this is called being memory bound. Your expensive GPU spends most of its life waiting for the next
00:01:52token instead of actually computing. So Google DeepMind looked at this problem and asked a clever
00:01:58question. If the cloud fills the idle time by serving 256 users at once, what if we filled that idle time
00:02:07for a single user instead? So instead of one token for 256 people, what if we generated 256 tokens for one
00:02:16person all at once? And that's the entire idea behind Diffusion Gemma. Instead of writing word by word,
00:02:23the model starts with a canvas, which is a row of 256 completely random placeholder tokens. So it's just
00:02:31noise. And its job is to fix that canvas all positions at once and turn it into real text. So by predicting all
00:02:38256 tokens in one shot, you're giving your GPU a big chunk of real work instead of letting it idle. In that way,
00:02:46you flip the model from being memory bound to compute bound, and all that wasted firepower finally gets used.
00:02:53But this is not as straightforward as it sounds. Predicting 256 tokens at once is actually really hard.
00:03:01Because how does the model guess token number 254 when it has no idea what tokens 1 through 253 turned
00:03:09out to be? And that's exactly what happens. The first few tokens come out good, but the further down it goes,
00:03:15the more it falls apart into nonsense. But what if instead of just doing one pass, what if the model does
00:03:21multiple passes? And this is the key trick. The model passes over the canvas again and again, but now it
00:03:28can see its own previous guesses. The tokens it predicted with confidence become context clues that
00:03:35help fix the messier ones. And the coolest thing is that it only needs a few passes. Way less passes than
00:03:42the total token count of 256. And that's exactly where the model's speed comes from. And you've probably seen
00:03:49this trick before. It's called diffusion. You start with noise and then you refine it step by step. And
00:03:55that is the exact same idea that powers AI image generators. And the way the model learns it is by
00:04:01deliberately adding noise to real images in training and then learning to predict and subtract that noise
00:04:07back out. But how do you apply that same concept to text? That is the tricky part. Because with an image,
00:04:14noise is easy. Make a pixel a little more red or blue. But with text, how do you make the word
00:04:19the be a little bit less the? What does that noise even mean for a word? Well, DeepMind came up with
00:04:27something called uniform state diffusion. So instead of fiddling with letters, you treat the randomly
00:04:32swapped out word as the noise. And to corrupt your training text, you replace some real words with random
00:04:38ones. And the model's job is to figure out which words are garbage and eventually fix them with multiple passes.
00:04:45There's actually a simpler version to do this called mask diffusion that just blanks out tokens.
00:04:51But that one has a big flaw. Once the model commits to a word, it's locked in forever. It has the same
00:04:57problem auto aggressive models have. But uniform state diffusion fixes this by always holding some kind of token in
00:05:04every position. So a model can look at a word it accepted three steps ago, decide if it doesn't fit
00:05:10anymore and swap it out. So we can basically self correct it all the way through. But this solution
00:05:15also has a catch. Diffusion needs an encoder to understand your prompt and a denoiser to clean the
00:05:23canvas. So DeepMind developed an encoder denoiser patch. It's built on top of their existing 26 billion
00:05:30GEMMA4 model and it switches between the two modes when it's generating your response. In encoder mode,
00:05:36the model reads your prompt, tries to extract some context and guidance for it. It collects all of that
00:05:42in KV cache and then passes that directly to the denoiser. And the denoiser's job is essentially to
00:05:49clean the canvas. And it does that by doing two things. First, remember how a normal LLM produces a
00:05:56confidence score or a logit for every position but throws all of them away except the last one? By the
00:06:02way, if you're getting confused here, I also made a video a while back explaining how LLMs work in more
00:06:07detail. So check that video out if you're interested. So essentially Diffusion GEMMA doesn't throw out
00:06:13the scores. It keeps all of those confidence scores because every canvas position needs its own prediction.
00:06:19And secondly, this denoiser doesn't use causal attention, which is the rule that a word can only
00:06:25look backward, which is how auto-aggressive models work. So instead, it swaps it with a bi-directional
00:06:31attention. So now every token can see every other token in all directions. So for every position,
00:06:38you apply those confidence scores, look at other tokens, and clean out the canvas slowly, step by step.
00:06:44And this is how Diffusion GEMMA is able to achieve its incredible speed of 1000 plus tokens per second
00:06:51on an H100 GPU. Now I have to be straight with you. This isn't a silver bullet. With these new tactics,
00:06:58Diffusion GEMMA is basically trading quality for speed. For maximum quality work, standard GEMMA 4 is
00:07:05still a better pick. This model is built specifically for critical local stuff like inline editing or code
00:07:13filling or rapid iteration. And it is especially strong at non-linear tasks like filling in the middle
00:07:19of a code block or even solving a Sudoku puzzle, which normal left to right models are genuinely quite bad
00:07:26at. So all of that sounds fascinating, but let's test it out for ourselves and see how it works in action.
00:07:33So Google has open sourced the weights under Apache 2.0 license on Hugging Face.
00:07:38So if you have a beefy GPU like an RTX 5090, you could try to run it locally. And there's also a
00:07:44special recipe for VLLM you can run on Docker to streamline that process. But I am really curious to
00:07:51see if this model can really reach 1000 plus tokens per second. So for this test, I will actually try to
00:07:58run it on an H100 GPU using a run pod container and see how it goes. And by the way, I have also
00:08:04published a Diffusion GEMMA template for running it on run pod. So if you want to replicate this test,
00:08:10all you have to do is run that template when creating a new pod. So to do this test on run pod,
00:08:15I'm going to choose the H100 container. And as I mentioned before, I created a Diffusion GEMMA
00:08:22template you can reuse. So you can just click that we click on a volume disk and then just click deploy
00:08:28on demand. And it will take a few minutes until it downloads the container and launches everything.
00:08:34And if we go to the logs, if you see application startup complete, that means that VLLM is ready
00:08:40and it is now accessible through port 8000. If we open this, you will see detail not found,
00:08:46but don't worry about it. This means it is actually working. We just need to copy this URL. So to
00:08:52configure Diffusion GEMMA to run in an AI agent terminal, something like open code, you need to
00:08:58configure your open code settings to access the remote server. So you can do that with this simple
00:09:04command and this will open up the config file. And in here, I'm just specifying our run pod server and
00:09:11it has the Diffusion GEMMA model selected. And you can just save this file and fire up open code.
00:09:17So in this test, I'm going to prompt it to generate a personal finance tracking dashboard called ledger.
00:09:24And let's see how fast it can generate that. Look at that. Instantly, it starts streaming right away.
00:09:34Look how blazingly fast that is. Holy moly. Wow. That is insane. And here in the logs,
00:09:43we can see that it's averaging 700 tokens per second. So for the output phase, it dropped a bit,
00:09:50but during the reasoning phase, it did go up to 700 tokens per second. That is insane. So let's
00:09:58open it up. Okay. So this looks like a dashboard. That's nice. Okay. We actually get some categories
00:10:06and stuff going on here. If we add something over here. Oh, it actually adds it as an expense. So the
00:10:13expenses are not actually updating. So it's not fully functional, but at least some parts are interactive.
00:10:20For this next task, let's see if it can actually make an arcade style game.
00:10:26So let's fire it up. Once again, the speed is just insane. Okay. This one is taking a bit longer.
00:10:36We actually got two files here. Interesting, interesting. Okay. So it noticed a typo and then it
00:10:44reprocessed the HTML file again, which is pretty good. Okay. All right. Let's open up this one. Restart.
00:10:52Oh, wow. This one is it's working. Oh, wow. This is cool. Wow. Very nice. That is impressive. So the game is
00:11:03fully functional and it took 14 seconds to generate this game. 14 seconds to generate a game like this.
00:11:11So although their marketing page said that we could expect a thousand token per second speeds on the H
00:11:18100. That was not my observation. Um, I don't know. Maybe there's something that I should tweak in the
00:11:26template or in my prompts, but nonetheless, I am truly impressed. It is a beast. So there you have it,
00:11:33folks. That is diffusion Gemma in a nutshell. I think this one is one of the most interesting releases
00:11:38of the year because it proves you can take a totally different generation paradigm from the image world,
00:11:44slap it onto an existing model you already trained and unlock real speed gains for single local user
00:11:51setups. And I think this also opens the door for a whole new family of fast interactive local models
00:11:58that utilizes the full potential of your hardware instead of leaving it idle. So what do you think
00:12:04about diffusion Gemma? Have you tried it? Will you use it? Let us know in the comment section down below.
00:12:09And folks, if you like these types of technical breakdowns, please let me know by smashing that
00:12:14like button underneath the video. And also don't forget to subscribe to our channel. This has been
00:12:19Andrus from Betterstack and I will see you in the next videos.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video