Transcript
00:00:00Google just unveiled their newest Gemma 4 12 billion parameter model and this is a game changer.
00:00:06No, seriously, this is not clickbait. This model is in fact a game changer in the way it is built.
00:00:13The thing that separates this from all other AI models is the fact that it is entirely
00:00:18encoder free. Now, what does that mean and how does it work and why is this such a big deal?
00:00:24Well, those are all good questions that we will explore in today's video. It's going to be a lot
00:00:29of fun. So let's dive into it. So the Gemma 4 12 billion model has a new architecture that
00:00:39completely breaks away from how every other multimodal model works. Multimodal model. Oh my
00:00:46God, that's such a such a tongue twister. So to understand why this is such a big deal, we have
00:00:51to look at how every other multimodal model handles things right now. Language models are built to read
00:00:57tokens, basically chunks of text turned into numbers. They don't naturally know what a pixel is or what a
00:01:05sound wave looks like. So usually we tape different models together. If you give AI an image, a massive
00:01:11vision encoder intercepts it first. It spends tons of processing power translating those raw pixels into a
00:01:19language the LLM can actually understand. And same goes for audio. A separate speech encoder has to
00:01:25translate the sound waves first. By the time the actual brain of the AI gets the data, you're running three
00:01:32separate networks at the same time. On a standard laptop, this completely hogs your VRAM and slows
00:01:38everything down. But Google DeepMind looked at this issue and thought, what if we can just cut out the
00:01:44middleman? So in the Gemma 412 billion model, they completely deleted the heavy vision encoder. Instead,
00:01:51when you feed it an image, the model chops it into small 48 by 48 pixel patches. And instead of passing
00:01:58those patches through dozens of layers of a separate vision network, the raw pixels pass through a single
00:02:04thin mathematical step called linear projection. And this linear projection is just a massive grid of numbers
00:02:11that takes 2304 pixel values, because that correlates to a 48 by 48 pixel square, multiplies them in a
00:02:19single step, and stretches them out into a single row that perfectly matches the LLM's text token
00:02:26format. So it doesn't just analyze what's in the image yet, it just reformats the raw data so it can fit
00:02:32through the model. And if you look at standard models, their vision encoders are massive. Like for example,
00:02:38this one has 550 million parameters. That's because a traditional encoder needs a lot of data to reshape,
00:02:45map, and understand the image. It has dozens of internal attention layers calculating relationships
00:02:50between pixels, trying to figure out where the edges are, what are the shapes, and what the objects might be
00:02:57before it even hands it to the text model. But DeepMind shrunk it by completely deleting all of that heavy
00:03:04brain power. They realized that the main language backbone is already incredibly smart and has plenty
00:03:10of layers to do the actual visual reasoning. So by removing all those thinking layers, they were left with
00:03:17just 35 million parameters, and that is literally just the raw physical count of connection weights needed
00:03:24to map those pixel grids into a text format. So it's a static single layer map that works for every image.
00:03:30Because it does zero internal thinking, it takes up practically no processing power, freeing up the VRAM
00:03:37and letting the main LLM handle the actual intelligence natively. And to understand how that single step works,
00:03:44you have to look at what's actually happening inside a language model backbone. Every language model has an
00:03:50internal formatting rule called a hidden dimension. Think of it like a standardized tray size. Whether it's
00:03:56the word apple or a piece of code or a punctuation, everything that gets fed into the LLM must be converted
00:04:04into this specific massive list of numbers because it has to match the dimensions of the matrices. And this raw
00:04:1148 by 48 pixel patch is just a grid of 2304 individual color numbers. If you try to feed that raw chunk
00:04:19directly into the LLM, the model will reject it because the dimensions don't actually match. And that is
00:04:26exactly why that 35 million parameter mapping layer exists. It is literally a single massive grid of
00:04:33connection weights that multiplies those 2304 pixel values and stretches them out into a single row that
00:04:40perfectly matches the LLM's text token format. It does zero analytical thinking, it just acts as a format
00:04:48converter so the data can slide right into the main transformer where the actual visual reasoning happens
00:04:54natively. And the model does something similar to audio reasoning as well, but for audio it's even simpler.
00:05:01So the way they managed to get rid of the audio encoder is by taking a raw 16 kilohertz audio signal and
00:05:07slicing it into continuous 40 millisecond frames. Each little frame contains exactly 640 floating point
00:05:15numbers describing the sound wave. The model takes those 640 floats and runs them through a similar
00:05:21simple projection layer that maps them straight into the language model's input space. To the transformer
00:05:28backbone, a 40 millisecond audio block looks identical to a continuous stream of text tokens. Because sound
00:05:35is already a chronological sequence, just like a sentence in a sequence of words, the LLM treats audio
00:05:42exactly like text. So this deep native integration lets the 12 billion parameter model handle live transcription,
00:05:49translation and text formatting in one single forward pass without forcing you to load separate speech
00:05:56networks into your memory. So this clever tactic is a massive win for running models locally on your own
00:06:02hardware. By stripping away all the encoder bloat, DeepMind managed to pack incredible reasoning
00:06:08power into a tiny footprint. And looking at the benchmark, it gets close to the performance of their massive 26
00:06:15billion parameter models, but it easily fits on a standard laptop with 16 gigabytes of VRAM
00:06:21or more. Plus Google included native multi-token prediction drafters right out of the box, meaning it predicts
00:06:28multiple tokens at a time for fast local inference speeds without forcing you to compress the model.
00:06:34So all of that sounds impressive. So now let's test it out and see how it works on my local M2 MacBook Pro.
00:06:41And some of the people in my previous OMLX video were asking how much VRAM do I actually have on my
00:06:48machine? So to answer that question, I have 24 gigabytes of VRAM. So that's what we're working with
00:06:53today. I also have to say this edge gallery application is so buggy. Like for example, if I try to add an
00:07:01image and ask, please analyze this image, it will instantly fail and give me this random error. And this
00:07:13is on the latest version. So unfortunately we couldn't test the vision encoder using the official AI edge
00:07:20gallery application, but there is another way we can actually test it out. Okay. So since I couldn't
00:07:26reliably test out the image processing with the Gemma for 12 billion model on Google AI edge gallery
00:07:34application, I decided to test it out on OMLX. And I also did a video about OMLX. It's an incredible
00:07:42framework for running AI models locally, specifically on Apple Silicon. And as you can see here, I have
00:07:47downloaded the eight bit quantized version of this model. So now I'm going to go to the chat section
00:07:54and let's see how fast it can actually do image reasoning in real time. So here I have a test folder
00:08:01with two images. One of them is just a screenshot of airport departures. So we'll use this image
00:08:09and ask what do you see in this image. And I want you to pay attention that I'm not speeding up this video.
00:08:18This is all real time. I want you to pay attention how fast it is able to do reasoning
00:08:24on such an image. So it is starting here, it's loading up the model, generating and boom, look at that.
00:08:33Look how fast it is able to parse through this picture and extract valuable information from it.
00:08:41First time I saw this on OMLX, I was genuinely blown away by the speed of it. It is absolutely insane.
00:08:50So I do have to say this is the best model that I tested out locally for image reasoning. And I also
00:08:57want you to pay attention to the fact that I'm running this model offline. I don't have my Wi Fi turned on.
00:09:03So now let's try another example. This one is just a blurry image of the TV show Vikings showing some
00:09:10characters. So once again, let's open up this image and ask the same thing. What do you see in this
00:09:21image? It's generating.
00:09:27And boom, look at that.
00:09:30I mean, that is just insane. This is so quick. I was so surprised.
00:09:37So yeah, I am honestly very, very impressed with the image processing performance of this new model.
00:09:43So there you have it, folks. That is the new encoder free Gemma 4 12 billion model in a nutshell.
00:09:50I was quite frustrated that I couldn't confidently test it out in their official AI edge gallery
00:09:56application. But as we saw, there are other alternative and maybe even better ways to run it
00:10:01locally. So I do think that this is a very nice model and it completely changes the future of running
00:10:07local AI models. Google DeepMind just basically proved that a single language backbone is smart enough
00:10:13to handle vision and sound natively. So this new technique will probably open doors to develop even
00:10:19more efficient multimodal reasoning models that can easily run on edge devices. So what do you think
00:10:26about the new Gemma model? Have you tried it? Will you use it? Let us know in the comments section down
00:10:32below. And folks, if you like these types of technical breakdowns, please let me know by smashing that like
00:10:37button underneath the video. And also don't forget to subscribe to our channel. This has been Andres
00:10:43from BetterStack and I will see you in the next videos.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video