Google’s New AI Generates 1,000+ Tokens Per Second (DiffusionGemma)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareVideo & Computer Games

Transcript

00:00:00Google has been on fire lately. Last week I did a video on their groundbreaking encoder-free

00:00:05Gemma 4 model and this week they dropped another shockingly innovative model. It's called Diffusion

00:00:11Gemma and this model is blazingly fast. It is capable of generating more than a thousand tokens

00:00:18per second and the reason why it's able to do that is because it generates text in a completely

00:00:23different way than any other model you've ever used before. So in this video we'll take a look

00:00:29at Diffusion Gemma, see how it works and I'll show you how you can test it out for yourself as well.

00:00:35It's gonna be a lot of fun so let's dive into it.

00:00:41So every language model you've ever talked to works the same fundamental way. They're auto-aggressive

00:00:48and that's a fancy word for saying they generate one token at a time left to right. They write a word

00:00:54then they look at everything written so far and then they predict the next word and the cycle just

00:00:59repeats. And the way it works for large commercial models like Claude or GPT is that when a server

00:01:06generates a token, most of the time isn't spent on computing, it's spent loading the model's weights

00:01:12out of memory. And that's kind of wasteful if you're serving just one user. So the servers batch hundreds

00:01:19of users together load the weights once and run them against everybody in the same time. And that way,

00:01:25you can serve 256 users with one memory load. But when you run a model locally, you're just one user,

00:01:33so there's nobody to batch you with. The GPU loads the massive portion of weights, does a tiny little

00:01:39computation to produce one token, and then it sits there idle before doing it all again. In technical terms,

00:01:46this is called being memory bound. Your expensive GPU spends most of its life waiting for the next

00:01:52token instead of actually computing. So Google DeepMind looked at this problem and asked a clever

00:01:58question. If the cloud fills the idle time by serving 256 users at once, what if we filled that idle time

00:02:07for a single user instead? So instead of one token for 256 people, what if we generated 256 tokens for one

00:02:16person all at once? And that's the entire idea behind Diffusion Gemma. Instead of writing word by word,

00:02:23the model starts with a canvas, which is a row of 256 completely random placeholder tokens. So it's just

00:02:31noise. And its job is to fix that canvas all positions at once and turn it into real text. So by predicting all

00:02:38256 tokens in one shot, you're giving your GPU a big chunk of real work instead of letting it idle. In that way,

00:02:46you flip the model from being memory bound to compute bound, and all that wasted firepower finally gets used.

00:02:53But this is not as straightforward as it sounds. Predicting 256 tokens at once is actually really hard.

00:03:01Because how does the model guess token number 254 when it has no idea what tokens 1 through 253 turned

00:03:09out to be? And that's exactly what happens. The first few tokens come out good, but the further down it goes,

00:03:15the more it falls apart into nonsense. But what if instead of just doing one pass, what if the model does

00:03:21multiple passes? And this is the key trick. The model passes over the canvas again and again, but now it

00:03:28can see its own previous guesses. The tokens it predicted with confidence become context clues that

00:03:35help fix the messier ones. And the coolest thing is that it only needs a few passes. Way less passes than

00:03:42the total token count of 256. And that's exactly where the model's speed comes from. And you've probably seen

00:03:49this trick before. It's called diffusion. You start with noise and then you refine it step by step. And

00:03:55that is the exact same idea that powers AI image generators. And the way the model learns it is by

00:04:01deliberately adding noise to real images in training and then learning to predict and subtract that noise

00:04:07back out. But how do you apply that same concept to text? That is the tricky part. Because with an image,

00:04:14noise is easy. Make a pixel a little more red or blue. But with text, how do you make the word

00:04:19the be a little bit less the? What does that noise even mean for a word? Well, DeepMind came up with

00:04:27something called uniform state diffusion. So instead of fiddling with letters, you treat the randomly

00:04:32swapped out word as the noise. And to corrupt your training text, you replace some real words with random

00:04:38ones. And the model's job is to figure out which words are garbage and eventually fix them with multiple passes.

00:04:45There's actually a simpler version to do this called mask diffusion that just blanks out tokens.

00:04:51But that one has a big flaw. Once the model commits to a word, it's locked in forever. It has the same

00:04:57problem auto aggressive models have. But uniform state diffusion fixes this by always holding some kind of token in

00:05:04every position. So a model can look at a word it accepted three steps ago, decide if it doesn't fit

00:05:10anymore and swap it out. So we can basically self correct it all the way through. But this solution

00:05:15also has a catch. Diffusion needs an encoder to understand your prompt and a denoiser to clean the

00:05:23canvas. So DeepMind developed an encoder denoiser patch. It's built on top of their existing 26 billion

00:05:30GEMMA4 model and it switches between the two modes when it's generating your response. In encoder mode,

00:05:36the model reads your prompt, tries to extract some context and guidance for it. It collects all of that

00:05:42in KV cache and then passes that directly to the denoiser. And the denoiser's job is essentially to

00:05:49clean the canvas. And it does that by doing two things. First, remember how a normal LLM produces a

00:05:56confidence score or a logit for every position but throws all of them away except the last one? By the

00:06:02way, if you're getting confused here, I also made a video a while back explaining how LLMs work in more

00:06:07detail. So check that video out if you're interested. So essentially Diffusion GEMMA doesn't throw out

00:06:13the scores. It keeps all of those confidence scores because every canvas position needs its own prediction.

00:06:19And secondly, this denoiser doesn't use causal attention, which is the rule that a word can only

00:06:25look backward, which is how auto-aggressive models work. So instead, it swaps it with a bi-directional

00:06:31attention. So now every token can see every other token in all directions. So for every position,

00:06:38you apply those confidence scores, look at other tokens, and clean out the canvas slowly, step by step.

00:06:44And this is how Diffusion GEMMA is able to achieve its incredible speed of 1000 plus tokens per second

00:06:51on an H100 GPU. Now I have to be straight with you. This isn't a silver bullet. With these new tactics,

00:06:58Diffusion GEMMA is basically trading quality for speed. For maximum quality work, standard GEMMA 4 is

00:07:05still a better pick. This model is built specifically for critical local stuff like inline editing or code

00:07:13filling or rapid iteration. And it is especially strong at non-linear tasks like filling in the middle

00:07:19of a code block or even solving a Sudoku puzzle, which normal left to right models are genuinely quite bad

00:07:26at. So all of that sounds fascinating, but let's test it out for ourselves and see how it works in action.

00:07:33So Google has open sourced the weights under Apache 2.0 license on Hugging Face.

00:07:38So if you have a beefy GPU like an RTX 5090, you could try to run it locally. And there's also a

00:07:44special recipe for VLLM you can run on Docker to streamline that process. But I am really curious to

00:07:51see if this model can really reach 1000 plus tokens per second. So for this test, I will actually try to

00:07:58run it on an H100 GPU using a run pod container and see how it goes. And by the way, I have also

00:08:04published a Diffusion GEMMA template for running it on run pod. So if you want to replicate this test,

00:08:10all you have to do is run that template when creating a new pod. So to do this test on run pod,

00:08:15I'm going to choose the H100 container. And as I mentioned before, I created a Diffusion GEMMA

00:08:22template you can reuse. So you can just click that we click on a volume disk and then just click deploy

00:08:28on demand. And it will take a few minutes until it downloads the container and launches everything.

00:08:34And if we go to the logs, if you see application startup complete, that means that VLLM is ready

00:08:40and it is now accessible through port 8000. If we open this, you will see detail not found,

00:08:46but don't worry about it. This means it is actually working. We just need to copy this URL. So to

00:08:52configure Diffusion GEMMA to run in an AI agent terminal, something like open code, you need to

00:08:58configure your open code settings to access the remote server. So you can do that with this simple

00:09:04command and this will open up the config file. And in here, I'm just specifying our run pod server and

00:09:11it has the Diffusion GEMMA model selected. And you can just save this file and fire up open code.

00:09:17So in this test, I'm going to prompt it to generate a personal finance tracking dashboard called ledger.

00:09:24And let's see how fast it can generate that. Look at that. Instantly, it starts streaming right away.

00:09:34Look how blazingly fast that is. Holy moly. Wow. That is insane. And here in the logs,

00:09:43we can see that it's averaging 700 tokens per second. So for the output phase, it dropped a bit,

00:09:50but during the reasoning phase, it did go up to 700 tokens per second. That is insane. So let's

00:09:58open it up. Okay. So this looks like a dashboard. That's nice. Okay. We actually get some categories

00:10:06and stuff going on here. If we add something over here. Oh, it actually adds it as an expense. So the

00:10:13expenses are not actually updating. So it's not fully functional, but at least some parts are interactive.

00:10:20For this next task, let's see if it can actually make an arcade style game.

00:10:26So let's fire it up. Once again, the speed is just insane. Okay. This one is taking a bit longer.

00:10:36We actually got two files here. Interesting, interesting. Okay. So it noticed a typo and then it

00:10:44reprocessed the HTML file again, which is pretty good. Okay. All right. Let's open up this one. Restart.

00:10:52Oh, wow. This one is it's working. Oh, wow. This is cool. Wow. Very nice. That is impressive. So the game is

00:11:03fully functional and it took 14 seconds to generate this game. 14 seconds to generate a game like this.

00:11:11So although their marketing page said that we could expect a thousand token per second speeds on the H

00:11:18100. That was not my observation. Um, I don't know. Maybe there's something that I should tweak in the

00:11:26template or in my prompts, but nonetheless, I am truly impressed. It is a beast. So there you have it,

00:11:33folks. That is diffusion Gemma in a nutshell. I think this one is one of the most interesting releases

00:11:38of the year because it proves you can take a totally different generation paradigm from the image world,

00:11:44slap it onto an existing model you already trained and unlock real speed gains for single local user

00:11:51setups. And I think this also opens the door for a whole new family of fast interactive local models

00:11:58that utilizes the full potential of your hardware instead of leaving it idle. So what do you think

00:12:04about diffusion Gemma? Have you tried it? Will you use it? Let us know in the comment section down below.

00:12:09And folks, if you like these types of technical breakdowns, please let me know by smashing that

00:12:14like button underneath the video. And also don't forget to subscribe to our channel. This has been

00:12:19Andrus from Betterstack and I will see you in the next videos.

Key Takeaway

By applying diffusion-based parallel generation instead of traditional auto-regressive methods, Diffusion Gemma achieves high-speed token generation suitable for local interactive tasks.

Highlights

Diffusion Gemma achieves speeds of up to 700 tokens per second on an H100 GPU by generating text in parallel rather than sequentially.
Unlike standard auto-regressive models that generate tokens one by one, Diffusion Gemma uses a diffusion-based approach to fill a 256-token canvas simultaneously.
The model utilizes uniform state diffusion, which treats randomly swapped words as noise and refines them through multiple passes to improve accuracy.
Diffusion Gemma employs bi-directional attention, allowing every token to observe every other token in all directions during the generation process.
The model is optimized for local, non-linear tasks such as inline code filling and iterative editing, rather than for generating maximum-quality, long-form content.

Timeline

Limitations of Auto-regressive Models

Traditional language models function auto-aggressively by generating one token at a time from left to right.
Commercial servers achieve efficiency by batching 256 users together to share memory loading costs.
Local execution for a single user causes GPUs to become memory-bound, resulting in significant idle time between token generation.

Standard large language models require loading massive weight sets into memory to predict the next word. When serving only one user, the GPU spends most of its operational life waiting for the next instruction rather than performing computation. This creates a bottleneck where the hardware's processing potential is underutilized.

The Diffusion Generation Paradigm

Diffusion Gemma generates 256 tokens simultaneously by starting with a canvas of random noise.
The model uses multiple passes over the canvas, utilizing previous token guesses as context clues to iteratively correct the text.
Uniform state diffusion allows the model to replace words and self-correct throughout the generation process instead of locking in tokens prematurely.

To eliminate idle time, Google DeepMind adapted diffusion techniques from image generation to text. By predicting 256 tokens in one shot, the GPU is kept busy, flipping the model from memory-bound to compute-bound. The process involves identifying and removing 'noise'—which, in this context, are randomly swapped words—until the output achieves coherence.

Architecture and Implementation

The model uses an encoder-denoiser architecture built on the 26 billion parameter Gemma 4 model.
Bi-directional attention replaces causal attention, enabling tokens to consider information from all directions.
Confidence scores for all canvas positions are retained throughout the process to guide the iterative refinement of the output.

The system switches between encoder mode to parse prompts and denoiser mode to refine the canvas. Unlike typical LLMs that discard all but the final confidence score, this model keeps all scores to inform the denoising process. This architectural change is what allows for the high-speed output observed on hardware like the H100.

Performance and Practical Testing

Diffusion Gemma prioritizes speed over maximum generation quality compared to standard models.
Practical tests show the model excels at interactive, non-linear tasks like code filling and building functional application prototypes.
The model is accessible via open-source weights under an Apache 2.0 license on Hugging Face.

While the model does not always reach the theoretical 1,000 token per second threshold in every application, it demonstrates impressive speed during reasoning-heavy tasks. A test generating a functional arcade-style game completed in 14 seconds, confirming its utility for rapid iteration. The technology proves that novel generation paradigms can effectively utilize local hardware resources.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video