Qwen 3.5 Small Models Are INCREDIBLE! (Testing 0.8B & 2B On Edge Devices)

BBetter Stack
컴퓨터/소프트웨어경제 뉴스가전제품/카메라스마트폰/모바일

Transcript

00:00:00The internet is losing its mind right now, and this time it's over Quen 3.5,
00:00:05specifically their small model series. Alibaba just released native multimodal
00:00:10versions of Quen 3.5 which are as small as 2 billion and even 0.8 billion parameters.
00:00:17They outperform some models 4 times their size in reasoning and vision.
00:00:22And they are so tiny that we can now run them locally on 6 year old laptops and smartphones
00:00:28with no internet connection. In this video, we're gonna take a look specifically at Quen 3.5's new
00:00:34small series models like the 0.8 billion and 2 billion. We're also gonna test them out on an
00:00:40M.2 MacBook Pro as well as on an iPhone 14 Pro and find out how powerful they actually are.
00:00:48It's gonna be a lot of fun, so let's dive into it.
00:00:55So why is everyone obsessed with these new Quen 3.5 models? After all, we've had small models for
00:01:01a while now. I even covered IBM's Granite 4.0 nano models in a previous video and their model
00:01:08was just 300 million parameters in size. So what makes these small Quen models so different?
00:01:14Well, it's all about something called intelligence density. You see, for a long time the rule was if
00:01:20you want a model that can see, reason and code, it has to be huge. But these new Quen 3.5 small models
00:01:27prove that that doesn't need to be the case. They somehow managed to compress their big models into
00:01:33tinier versions that still support unified multimodal architecture. That means their
00:01:390.8 billion model doesn't just answer text, it also has vision and coding abilities baked into it.
00:01:46Let's look at their benchmarks real quick, because they are quite interesting. On the MMLU benchmark,
00:01:51which measures general knowledge and reasoning, the 2 billion model achieves a score of 66.5,
00:01:57while the 0.8 billion model reaches 42.3. Which might not sound too impressive, but keep in mind
00:02:04that for context, the original Llama 2 with 7 billion parameters, which came out back in 2023,
00:02:11scored 45.3 on the same benchmark. This just goes to show how much we've managed to shrink
00:02:17the parameter size and still maintain a decent comprehension score. But check this out, their
00:02:23real standout is their multimodal performance. In specialized vision tests like OCRBench,
00:02:29the 2 billion model scores 85.4 and the 0.8 billion hits 79.1. Indicating that they are
00:02:37highly capable of tasks like reading complex documents and analyzing images with text.
00:02:43Oh, and they both support a massive 262K context window, so you can feed them entire PDFs or use
00:02:51them to analyze large codebases. That is kind of impressive. But now, let's look at how they
00:02:56actually perform. Since both the 0.8 billion and 2 billion models can run locally on almost any
00:03:02modern laptop, I'm going to conduct these tests in full airplane mode with no internet connection
00:03:08whatsoever on my local laptop. For the first test, we'll spin up a local server on LM Studio
00:03:14and hook it up to CLINE in VS Code to see if these tiny models can actually handle a real-world coding
00:03:21task. So first you have to go to the models tab and download the GGUF versions of the 0.8 billion and
00:03:28the 2 billion parameter models. And since we'll be using these models for coding tasks, we will also
00:03:33need to increase the available context length quite a bit. And once we've done that, we can go ahead
00:03:38and start the server. And now let's jump into CLINE. And first of all, as I mentioned, I will turn off
00:03:43my Wi-Fi so we can conduct these tests completely offline. And then in CLINE at the API configuration
00:03:50section, I will make sure to point our custom LM Studio server URL. And let's also make sure that
00:03:56we choose the 0.8 billion model. And for the prompt, I will ask the model to build a simple
00:04:01company website for a small cafe. And I also noticed that if we don't specify any particular framework
00:04:07and we let QUEN choose on its own, it will choose to install React, which will not work for our demo
00:04:14in offline mode. So I modified the prompt a bit to specifically ask to use HTML, CSS, and JavaScript
00:04:20without any external libraries. So let's run the test. So it took the model roughly one minute to
00:04:25finish this task. And here's our final result. As you can see, the site is very bland, the design is
00:04:32not very aesthetically pleasing, and the text is very dark. And I also noticed that in the CSS, the
00:04:37model tried to hard code specific images from Unsplash that would fit our theme. So that's an
00:04:43interesting observation. And if we turn on Wi-Fi back for a moment, we can see that one of those
00:04:48images actually loads up. And it appears to be an image of a doctor holding a phone. So that's pretty
00:04:54random. But the other images contain invalid URLs. And I also tried to prompt the model again to fix
00:05:00the broken text and also improve other areas, but it could not reliably do so. So overall, I would
00:05:06say that although this model is capable of coding and tool calling, I don't think it's actually a
00:05:12good idea to use this in real world scenarios, because the parameter count is just too low. But
00:05:17now let's test out the 2 billion parameter model with the same prompt and see how well it does. And
00:05:23this model actually gave me a lot of headaches because very often it would get stuck into a loop,
00:05:28writing the same section again and again. So I had to stop the task and restart it again. I'm not sure
00:05:34if this is a problem with the model itself or the way LM Studio conducts the server or the way Cline
00:05:40processes the prompt. But with this specific configuration, this was an ongoing struggle
00:05:45for me. And another thing I noticed is that while the 0.8 billion parameter model went straight into
00:05:51coding, the 2 billion parameter version preferred structuring a plan first and then proceeding with
00:05:57the actual coding. So the 2 billion parameter model finished this task in roughly three minutes,
00:06:02so considerably longer. And let's see what the final result is. So as we can see, it's already
00:06:08a step up because the design looks a lot cleaner and it uses a brownish theme, which is closer to
00:06:14what a coffee shop visual identity would be. And another thing I noticed is that if we turn on Wi-Fi,
00:06:20it actually loads up some external icons, which makes the whole site look even better.
00:06:24And this version actually tried to implement the cart functionality that I initially asked for
00:06:29because we now get this nice cart sidebar, although I don't see an add to cart button on the item
00:06:35cards. And when I tried to prompt to fix these issues, once again, I got into the same technical
00:06:41issue where the model went into an infinite loop. So I figured this just might be an issue with
00:06:46LM Studio in conjunction with Klein or something of that sort. But let's be honest, obviously,
00:06:51no one would seriously consider using such small models for complex and serious coding.
00:06:56I just conducted these tests out of curiosity to see if such a small parameter count can still
00:07:02produce a meaningful result for a given coding task. So now let's do something more exciting.
00:07:07Let's try to run these models on an iPhone 14 Pro. And to do this, I built a native iOS app using
00:07:14Swift and the MLX Swift framework. And MLX is Apple's open source library that allows you to run
00:07:22models directly on the Apple silicon unified memory architecture. By leveraging the metal GPU, we can
00:07:29get these quen models running with hardware acceleration right on the device. I will also
00:07:34put a link in the description to the repo for this Swift project so you can download it and compile it
00:07:40on your own device. So as soon as we open the app, it will immediately start downloading the 0.8
00:07:46billion model. And once that is done, we are now ready to use it. But before prompting anything,
00:07:52let me switch on the airplane mode on my iPhone. So now let's start with a simple hello. For some
00:07:58reason, it replies that its name is Alex. Okay, that's very random, but okay. But did you notice
00:08:04how fast the response was streamed? I'm honestly blown away at the speed on how quickly this model
00:08:10answers you in real time. Now let's try the famous carwash test, which most models usually get wrong.
00:08:17And would you look at that quen 3.5 actually answers correctly. So that is already impressive.
00:08:23Now the coolest thing about these models is that they can also use vision capabilities. So now I'm
00:08:29going to show it an image of a banana. And let's see if it understands what it is and in what condition
00:08:35it is. So it does correctly identify that it is indeed a banana, although it says it's a dog
00:08:40banana. I've honestly never heard of this term. A dog banana? What is that? What is quen talking
00:08:47about here? Alright, but anyway, it thinks that it is overripe. And it cautions me that it might not
00:08:52be safe to eat, which is not true. I had that banana this morning, and it was delicious. But anyway,
00:08:58once again, I'm just blown away by how fast it's processing my prompt and giving me back the
00:09:04response. Now let's try another picture. Let's see if it can identify the breed of the dog in
00:09:09this picture. So here we can see that it is not quite accurate because it thinks that it sees two
00:09:15dogs, which is not true. And it does not mention the breed. So let's ask it specifically what kind
00:09:20of dog it is. So it thinks it's a golden retriever, which is obviously very far from the truth. So
00:09:27although some of the responses are not entirely accurate, and some of them are just really funny,
00:09:34I'm still genuinely impressed by the fact that such a small model can reason about contents of an
00:09:39image and do it in such a quick manner. And last thing I want to test is this model's OCR abilities,
00:09:45as it was touted in the benchmarks. Specifically, I want to see if this model can identify what is the
00:09:50language of the text content presented in this image. To give you some context, the language
00:09:55displayed in this image is Latvian, which is actually my native language, because I am
00:10:00originally from Latvia. And unfortunately, Quen fails this test because this is not Slovenian,
00:10:05nor is our language even similar to Slovenian. And I also find it funny how confidently it
00:10:11translates a word to the same word, which I'm not even sure is a real word. So clearly there are some
00:10:19heavy hallucinations going on in this prompt response. All right, let's now move to the 2 billion
00:10:25parameter model. When you switch the dropdown, it will first going to download it. And once that is
00:10:30done, we can now run the same tests on this version to see if we get some meaningful improvements. So
00:10:36let's start with the simple hello again. Okay, and at least this time, it's not Alex responding. So
00:10:42that is already an improvement. Now let's do the carwash test again. And once again, the model passes
00:10:47the carwash test. So well done there. Now let's proceed with the banana image. And this time,
00:10:53we get a more meaningful answer. It does detect that it is indeed a banana. And as for the
00:11:00condition, it says that it's fully ripe and ready to eat, which is true. Now let's try the dog picture
00:11:06again. And this one says it's a Pomeranian. I mean, I don't think these breeds are even
00:11:11relatively similar. So unfortunately, even the 2 billion model is bad at identifying dog breeds.
00:11:18And lastly, let's try the picture with the text again and see if it can identify the language.
00:11:22And look at that the 2 billion parameter model did correctly identified that this text is indeed
00:11:29Latvian. That is pretty cool. So there you have it. Those are the Quinn 3.5 small model series. I
00:11:36honestly think that despite the little inconsistencies, these are indeed the most powerful tiny models
00:11:42I've ever used. The fact that we can now have an open source native multimodal LLM running on an
00:11:49iPhone 14 Pro offline and producing meaningful results with a relatively fast inference speed
00:11:55is super impressive. So Quinn really has outdone themselves this time. Well done. But there is a
00:12:01bit of a somber update to share. As I was finishing this video, reports surface that Alibaba is
00:12:07undergoing a major restructuring of the Quinn team. Key leadership figures and top engineers behind
00:12:13these models have reportedly departed, some to pursue their own AI startups. This has left the
00:12:18community wondering if the Quinn era of rapid breakthroughs might be slowing down. It makes
00:12:24these current models even more significant as they might actually be the last major release from this
00:12:30specific team for a while. But what do you think about these small series models? Have you tried
00:12:35them? Will you use them? Let us know in the comments down below. And folks, if you like these
00:12:39types of technical breakdowns, please let me know by smashing that like button underneath the video.
00:12:45And also don't forget to subscribe to our channel. This has been Andres from Better Stack and I will
00:12:50see you in the next videos.

Key Takeaway

The Qwen 3.5 small models represent a significant leap in intelligence density, enabling powerful multimodal AI tasks to run natively and privately on standard consumer hardware.

Highlights

Alibaba released Qwen 3.5 small models (0.8B and 2B) featuring native multimodal and reasoning capabilities.

The 0.8B model outperforms larger predecessors like Llama 2 (7B) in reasoning benchmarks despite its tiny footprint.

Both models support a massive 262K context window, making them suitable for long-document analysis and coding.

Local testing on edge devices (MacBook Pro and iPhone 14 Pro) demonstrates high-speed, offline inference using MLX.

Vision and OCR testing showed that while the 0.8B model hallucinates, the 2B model successfully identifies languages like Latvian.

The 2B model provides cleaner code output and better design compared to the 0.8B version, though it suffered from infinite loops.

Internal restructuring at Alibaba has led to the departure of key Qwen team members, potentially slowing future releases.

Timeline

Introduction to Qwen 3.5 Small Models

The speaker introduces the newly released Qwen 3.5 small models from Alibaba, focusing on the 0.8 billion and 2 billion parameter versions. These models are revolutionary because they bring native multimodal capabilities to extremely small architectures that can run on six-year-old hardware. The concept of "intelligence density" is highlighted as the core reason why these models outperform larger ones in reasoning and vision. This section sets the stage for a hands-on performance test on a MacBook Pro and an iPhone 14 Pro. The speaker emphasizes that the goal is to see how much power can be packed into such tiny, local-first models.

Benchmarking and Technical Specifications

This segment delves into the technical performance metrics of the Qwen 3.5 series, specifically the MMLU and OCRBench scores. On the MMLU benchmark, the 2B model achieved a 66.5, significantly higher than the original 7B Llama 2 model from 2023. Multimodal performance is a standout feature, with the 2B model scoring 85.4 on vision tests, indicating high proficiency in reading complex documents. Both models remarkably support a 262K context window, allowing for the analysis of massive PDFs or entire codebases. These statistics illustrate how much model efficiency has improved, enabling high comprehension scores in shrunk parameter sizes.

Local Coding Tests with LM Studio and Cline

The speaker conducts a real-world coding test by attempting to build a cafe website using the 0.8B and 2B models in full airplane mode. The 0.8B model produced a functional but aesthetically bland site in about one minute, though it struggled with broken image links and hallucinations. When testing the 2B model, it took three minutes and provided a much cleaner design with specialized brown themes and working external icons. However, the 2B model faced technical issues, such as getting stuck in infinite loops during the code generation process. Ultimately, the speaker concludes that while these tiny models are not ready for complex production coding, their ability to handle basic tool calling is impressive.

Edge Device Performance on iPhone 14 Pro

The focus shifts to mobile performance using a native iOS app built with Apple's MLX Swift framework for hardware acceleration. The 0.8B model demonstrates incredibly fast inference speeds, answering simple prompts and passing the "carwash test" almost instantly. Vision tests yield mixed results, as the model correctly identifies a banana but hallucinates strange terms like "dog banana" and fails to identify dog breeds accurately. OCR testing reveals that the 0.8B model cannot recognize the Latvian language, confidently misidentifying it as Slovenian. Despite these errors, the speaker is stunned by the fluid, real-time response speed on a handheld device without an internet connection.

2B Model Mobile Testing and Conclusion

The 2B model is tested on the iPhone, showing significant improvements over the smaller 0.8B version in both vision and language recognition. It correctly identifies the condition of a banana as "fully ripe" and successfully recognizes the Latvian language in the OCR test. The video concludes with a somber update regarding the Alibaba AI team, noting that many key engineers have left to start their own ventures. This restructuring suggests that these Qwen 3.5 models might be the last major release from the original team for the foreseeable future. The speaker encourages viewers to try these models, calling them the most powerful tiny models he has ever encountered.

Community Posts

View all posts