Did Google Just Make The ULTIMATE Edge AI Model? (Gemma 4)

BBetter Stack
컴퓨터/소프트웨어어학(외국어)가전제품/카메라스마트폰/모바일AI/미래기술

Transcript

00:00:00Last week, Google did something unexpected.
00:00:02They released a truly open-source model under Apache 2.0 license.
00:00:08It's called Gemma 4 and it features specialized edge versions as small as 2.3 billion parameters
00:00:14that are designed to run entirely offline on devices like your iPhone, Android flagship
00:00:21phones, or even on a Raspberry Pi.
00:00:23It seems like the race to build the ultimate small model is really heating up.
00:00:28Just a few weeks ago I did some tests on QWAN 3.5 to see how it was pushing the limits of
00:00:33local AI, but now Google is promising even higher intelligence density.
00:00:39So in this video, we're gonna perform similar tests on Gemma 4 to see if this model is truly
00:00:44the best small model out there.
00:00:47It's gonna be a lot of fun, so let's dive into it.
00:00:53So what's so unique about these new Gemma 4 models?
00:00:57Well, the real technical shift here is something Google calls per-layer embeddings.
00:01:03In traditional transformers, a token gets one embedding at the start that has to carry
00:01:08all its meaning through every layer.
00:01:11But in Gemma 4, each layer has its own set of embeddings, allowing the model to introduce
00:01:16new information exactly where it's needed.
00:01:19This is why you see the E in the E2B and E4B model names.
00:01:24It stands for effective parameters.
00:01:27While the model acts with the reasoning depth of a 5 billion parameter model, it only uses
00:01:32about 2.3 billion active parameters during inference.
00:01:36This results in a much higher intelligence density, allowing it to handle complex logic
00:01:42while using less than 1.5 gigabytes of RAM.
00:01:46And beyond the text performance, Gemma 4 is natively multimodal.
00:01:50This means vision, text, and even audio are processed within the same unified architecture
00:01:56rather than being bolted on as separate modules.
00:01:59This architecture enables a new thinking mode that uses an internal reasoning chain to verify
00:02:05its own logic before giving you an answer.
00:02:08This is specifically designed to prevent the infinite loops and logic errors that often
00:02:13plague small models.
00:02:15It also ships with 128K context window and support for over 140 languages, which should
00:02:22make it significantly more capable at tasks like complex OCR or localized language identification.
00:02:29And to showcase these abilities, Google released some eye-opening benchmarks.
00:02:34In their internal tests, the E4B model achieved a score of 42.5% on the AIME 2026 mathematics
00:02:43benchmark, which is more than double the score of much larger previous generation models.
00:02:49They also demonstrated the model's agentic potential on the T2 bench, where it showed
00:02:54a massive jump in tool use accuracy.
00:02:57They also demonstrated the model's agentic potential through a feature called agent skills.
00:03:02Instead of just generating static text, the model was shown using native function calling
00:03:07to handle multi-step workflows like querying Wikipedia for live data or building an end-to-end
00:03:13animal calls widget.
00:03:15Now all of that sounds impressive, but let's try it on our own and see how it works.
00:03:20In my previous QUEN 3.5 video, I tested the small models by running them locally without
00:03:25internet connection using LMStudio and CLINE.
00:03:28I will use the same setup for testing GEMMA 4.
00:03:32First we have to download the models on LMStudio, then increase the available context window
00:03:37and start the server.
00:03:39We can then jump into CLINE and hook up our local LMStudio server, choose the E2B model,
00:03:45turn off our internet connection and begin our tests.
00:03:49Last time we saw that QUEN 3.5 was quite decent at generating a simple CAFE website using HTML,
00:03:55CSS and JavaScript with two of their smallest parameter models.
00:04:00Let's reuse the same prompt and see if GEMMA 4 is just as good at this coding task.
00:04:05So it took the E2B model roughly 1.5 minutes to complete this task.
00:04:10And for a model with 2.3 billion active parameters, the results were honestly a bit underwhelming
00:04:16if compared to the result of QUEN's output which used only 0.8 billion parameters.
00:04:22The most annoying thing was that GEMMA appended the task list at the end of the HTML file as
00:04:28well as at the end of the CSS file so I had to manually delete it from both files before
00:04:33opening the page.
00:04:34And it also claimed it had written a JavaScript file, when in fact there was no JS file produced
00:04:40in the final output, so the E2B test results were a bit disappointing.
00:04:45But this situation did improve quite a lot when switching to the E4B model version.
00:04:50It took this version roughly 3.5 minutes to finish the task, but the end result was notably
00:04:55better.
00:04:56Maybe not in terms of design, it still looks very bland, but this version actually had a
00:05:00working card functionality which none of the previous tests, both for QUEN and GEMMA, were
00:05:06able to produce successfully.
00:05:08So the E4B version is already a big step up from the E2B version, but obviously no one
00:05:15would seriously consider using such small models for complex or serious coding.
00:05:20I just conducted these tests out of curiosity to see if such a small parameter count can
00:05:25still produce a meaningful result for a given coding task.
00:05:29Alright now let's see how GEMMA 4 performs on edge devices like an iPhone.
00:05:34So in my QUEN 3.5 video, I built a custom iOS app which was capable of running the model
00:05:40on the native Metal GPU using Swift's MLX framework.
00:05:44Although GEMMA 4 is open source, unfortunately there are no MLX bindings available for this
00:05:49model as of now, which would be capable of running this model on iOS with multimodal capabilities.
00:05:56And Google themselves are running GEMMA 4 on their AI Edge Gallery app using their own
00:06:01inference framework called Lite RTLM, which sadly also doesn't support iOS bindings at
00:06:07the moment.
00:06:08So to try it out on an iPhone, our best option right now is to use their Edge Gallery app.
00:06:13So we're going to conduct our tests on their own app and see how it performs.
00:06:18So let's go to the AI chat section.
00:06:20And here we will be prompted to download the E2B version of GEMMA 4.
00:06:25And you also have the option to download the E4B version, but for some reason the app says
00:06:29I don't have sufficient space to download it, which I'm sure is not true, so maybe that's
00:06:34a bug in the app.
00:06:36But anyway, now that I've downloaded the model, we can finally start using it.
00:06:41And let's start by typing a simple hello.
00:06:43Wow, did you see how fast the response was?
00:06:46A lot faster than QUEN 3.5.
00:06:48Maybe this is the magic of the Lite RTLM framework they're using.
00:06:53So now let's try the famous car wash test and see if GEMMA gets it correctly.
00:06:57Wow, it gives me a really long response.
00:07:00And at the end of it, we see that the final recommendation is to drive, which is correct,
00:07:06but I do have to take into account the fact that it's looking at convenience and comfort
00:07:10and not the actual logical fact.
00:07:13So I don't know, it kind of passes the test, but it kind of doesn't at the same time.
00:07:18All right, now let's hop on to the ask image section and let's see if GEMMA can identify
00:07:24the dog in this picture.
00:07:26So it did identify that it is indeed a dog and it gives some other details about the image.
00:07:31So that's pretty cool.
00:07:32But if I ask it, what's the breed of the dog?
00:07:35It replies saying that it's a Border Collie, which is not true.
00:07:39It is actually a Corgi.
00:07:40But I do have to say, for just over 2 billion active parameters, this response is pretty
00:07:45good nonetheless.
00:07:46Lastly, let's try the OCR test.
00:07:48So if you watched my previous video with Quen 3.5, you will recall that I tested it with
00:07:54an image that had text in it, which was in Latvian, which is also my native language.
00:07:59Now GEMMA touts as being able to understand up to 140 languages.
00:08:05So I assume it should pass this test easily.
00:08:08And yes, indeed, it does identify that the language is Latvian.
00:08:13And I'm surprised that most of the text is actually pretty spot on.
00:08:16With some minor exceptions, I see that some words are nonexistent and some of the grammatical
00:08:22structures are just very bizarre.
00:08:24But it's still very impressive.
00:08:26So I'll give this test a pass.
00:08:28Now, this actually begs the question, can I chat with this model in Latvian?
00:08:32So let me try that next.
00:08:33So I see that the response is actually in Latvian.
00:08:36But once again, the grammatical structures are very bizarre.
00:08:39And nobody talks like that.
00:08:41But still, Latvian is a very small language.
00:08:44So this is already impressive that it has all that knowledge in such a small model.
00:08:48And while I'm at it, I'm going to ask it, what is the current US president to see what
00:08:53is the knowledge cutoff of GEMMA 4?
00:08:56And it replies that it is Joe Biden.
00:08:58And then if I actually ask, what is your knowledge cutoff?
00:09:02It will tell me that it's January 2025, which checks out.
00:09:06So there you have it.
00:09:07That is GEMMA 4, the newest open source model by Google.
00:09:10And I got to be honest, this model does seem pretty good.
00:09:14It does what it advertises, albeit it lacks some creativity in web design.
00:09:19But other than that, the small models, as we just saw, are more than capable of successfully
00:09:24completing all the tasks I was giving it.
00:09:27It's a shame we still don't have the MLX bindings for this model, because I would really love
00:09:32to use GEMMA 4 locally on a custom iOS app.
00:09:36But I'm sure that it won't take long for Google to get this release out to the public.
00:09:41And in the meantime, I'm keeping a close eye on community projects like SwiftLM, which are
00:09:46already working on unofficial native bindings for these models.
00:09:50So those are my two cents on the model.
00:09:52What do you think about GEMMA 4?
00:09:54Have you tried it?
00:09:55Will you use it?
00:09:56Let us know in the comment section down below.
00:09:59And folks, if you like these types of technical breakdowns, please let me know by smashing
00:10:03that like button underneath the video.
00:10:05And also don't forget to subscribe to our channel.
00:10:07This has been Andres from BetterStack and I will see you in the next videos.

Key Takeaway

Gemma 4 achieves high intelligence density on edge devices by using per-layer embeddings and a native multimodal architecture, allowing a 2.3 billion parameter model to perform logic and OCR tasks locally with less than 1.5 GB of RAM.

Highlights

Gemma 4 is a truly open-source model released under the Apache 2.0 license, featuring edge-optimized versions as small as 2.3 billion parameters.

The architecture utilizes per-layer embeddings, allowing each layer to introduce new information rather than relying on a single initial token embedding.

The E2B model achieves reasoning depth comparable to a 5 billion parameter model while using only 2.3 billion active parameters and less than 1.5 gigabytes of RAM.

Internal benchmarks show the E4B model scoring 42.5% on the AIME 2026 mathematics test, doubling the performance of much larger previous generation models.

The model includes a 128K context window and native support for over 140 languages, enabling complex OCR and localized language tasks.

Native multimodality integrates vision, text, and audio into a unified architecture rather than using separate modular bolt-ons.

The knowledge cutoff for Gemma 4 is January 2025, and it correctly identifies current political figures like the US President as of that date.

Timeline

Technical Architecture and Per-Layer Embeddings

  • Gemma 4 uses per-layer embeddings to introduce information at specific points in the transformer stack.
  • The model achieves the reasoning depth of 5 billion parameters while only active parameters total 2.3 billion during inference.
  • System requirements for the smallest model remain under 1.5 gigabytes of RAM for complex logic tasks.

Standard transformers typically use a single embedding for a token at the start of the process. Gemma 4 differentiates itself by providing unique embeddings for every layer, designated by the 'E' for 'effective parameters' in the model names. This design allows the model to run entirely offline on hardware as limited as a Raspberry Pi or flagship smartphones.

Multimodal Capabilities and Agentic Benchmarks

  • Native multimodality integrates vision, text, and audio into one unified architecture.
  • An internal reasoning chain verifies logic before providing answers to prevent infinite loops common in small models.
  • The E4B version doubles previous generation performance on the AIME 2026 mathematics benchmark with a 42.5% score.

The unified architecture avoids the performance loss associated with bolting separate vision or audio modules onto a text model. It supports agent skills and native function calling for multi-step workflows like querying live data from Wikipedia. These capabilities are paired with a 128K context window and support for 140 languages to improve OCR and translation accuracy.

Local Performance Testing and Coding Benchmarks

  • The E2B model completes a simple web development task in 1.5 minutes but leaves technical errors in the output.
  • The E4B model takes 3.5 minutes to generate a website but successfully implements a working shopping cart functionality.
  • Small parameter models frequently struggle with metadata, such as incorrectly appending task lists to the end of generated code files.

Tests conducted using LMStudio and CLINE show that while the E2B model is fast, it can fail to produce required JavaScript files and may add irrelevant text to HTML and CSS files. The E4B model demonstrates a significant jump in capability by producing functional logic that smaller models like Qwen 0.8B could not achieve. These tests indicate that while capable of producing meaningful results, these models are not yet suitable for complex production-level coding.

Edge Device Testing on iPhone

  • Gemma 4 runs on the AI Edge Gallery app using the Lite RTLM inference framework.
  • The E2B model identifies objects in images but can struggle with specific details like distinguishing a Corgi from a Border Collie.
  • The model identifies Latvian text correctly in OCR tests despite having some grammatical irregularities in its generated Latvian prose.

Inference speeds on the iPhone using Lite RTLM are notably faster than competing small models. The model passes logical challenges like the car wash test and accurately identifies its own knowledge cutoff as January 2025. While official MLX bindings for iOS are currently missing, community projects like SwiftLM are developing unofficial native support for these edge-optimized models.

Community Posts

View all posts