00:00:00The internet is losing its mind right now, and this time it's over Quen 3.5,
00:00:05specifically their small model series. Alibaba just released native multimodal
00:00:10versions of Quen 3.5 which are as small as 2 billion and even 0.8 billion parameters.
00:00:17They outperform some models 4 times their size in reasoning and vision.
00:00:22And they are so tiny that we can now run them locally on 6 year old laptops and smartphones
00:00:28with no internet connection. In this video, we're gonna take a look specifically at Quen 3.5's new
00:00:34small series models like the 0.8 billion and 2 billion. We're also gonna test them out on an
00:00:40M.2 MacBook Pro as well as on an iPhone 14 Pro and find out how powerful they actually are.
00:00:48It's gonna be a lot of fun, so let's dive into it.
00:00:55So why is everyone obsessed with these new Quen 3.5 models? After all, we've had small models for
00:01:01a while now. I even covered IBM's Granite 4.0 nano models in a previous video and their model
00:01:08was just 300 million parameters in size. So what makes these small Quen models so different?
00:01:14Well, it's all about something called intelligence density. You see, for a long time the rule was if
00:01:20you want a model that can see, reason and code, it has to be huge. But these new Quen 3.5 small models
00:01:27prove that that doesn't need to be the case. They somehow managed to compress their big models into
00:01:33tinier versions that still support unified multimodal architecture. That means their
00:01:390.8 billion model doesn't just answer text, it also has vision and coding abilities baked into it.
00:01:46Let's look at their benchmarks real quick, because they are quite interesting. On the MMLU benchmark,
00:01:51which measures general knowledge and reasoning, the 2 billion model achieves a score of 66.5,
00:01:57while the 0.8 billion model reaches 42.3. Which might not sound too impressive, but keep in mind
00:02:04that for context, the original Llama 2 with 7 billion parameters, which came out back in 2023,
00:02:11scored 45.3 on the same benchmark. This just goes to show how much we've managed to shrink
00:02:17the parameter size and still maintain a decent comprehension score. But check this out, their
00:02:23real standout is their multimodal performance. In specialized vision tests like OCRBench,
00:02:29the 2 billion model scores 85.4 and the 0.8 billion hits 79.1. Indicating that they are
00:02:37highly capable of tasks like reading complex documents and analyzing images with text.
00:02:43Oh, and they both support a massive 262K context window, so you can feed them entire PDFs or use
00:02:51them to analyze large codebases. That is kind of impressive. But now, let's look at how they
00:02:56actually perform. Since both the 0.8 billion and 2 billion models can run locally on almost any
00:03:02modern laptop, I'm going to conduct these tests in full airplane mode with no internet connection
00:03:08whatsoever on my local laptop. For the first test, we'll spin up a local server on LM Studio
00:03:14and hook it up to CLINE in VS Code to see if these tiny models can actually handle a real-world coding
00:03:21task. So first you have to go to the models tab and download the GGUF versions of the 0.8 billion and
00:03:28the 2 billion parameter models. And since we'll be using these models for coding tasks, we will also
00:03:33need to increase the available context length quite a bit. And once we've done that, we can go ahead
00:03:38and start the server. And now let's jump into CLINE. And first of all, as I mentioned, I will turn off
00:03:43my Wi-Fi so we can conduct these tests completely offline. And then in CLINE at the API configuration
00:03:50section, I will make sure to point our custom LM Studio server URL. And let's also make sure that
00:03:56we choose the 0.8 billion model. And for the prompt, I will ask the model to build a simple
00:04:01company website for a small cafe. And I also noticed that if we don't specify any particular framework
00:04:07and we let QUEN choose on its own, it will choose to install React, which will not work for our demo
00:04:14in offline mode. So I modified the prompt a bit to specifically ask to use HTML, CSS, and JavaScript
00:04:20without any external libraries. So let's run the test. So it took the model roughly one minute to
00:04:25finish this task. And here's our final result. As you can see, the site is very bland, the design is
00:04:32not very aesthetically pleasing, and the text is very dark. And I also noticed that in the CSS, the
00:04:37model tried to hard code specific images from Unsplash that would fit our theme. So that's an
00:04:43interesting observation. And if we turn on Wi-Fi back for a moment, we can see that one of those
00:04:48images actually loads up. And it appears to be an image of a doctor holding a phone. So that's pretty
00:04:54random. But the other images contain invalid URLs. And I also tried to prompt the model again to fix
00:05:00the broken text and also improve other areas, but it could not reliably do so. So overall, I would
00:05:06say that although this model is capable of coding and tool calling, I don't think it's actually a
00:05:12good idea to use this in real world scenarios, because the parameter count is just too low. But
00:05:17now let's test out the 2 billion parameter model with the same prompt and see how well it does. And
00:05:23this model actually gave me a lot of headaches because very often it would get stuck into a loop,
00:05:28writing the same section again and again. So I had to stop the task and restart it again. I'm not sure
00:05:34if this is a problem with the model itself or the way LM Studio conducts the server or the way Cline
00:05:40processes the prompt. But with this specific configuration, this was an ongoing struggle
00:05:45for me. And another thing I noticed is that while the 0.8 billion parameter model went straight into
00:05:51coding, the 2 billion parameter version preferred structuring a plan first and then proceeding with
00:05:57the actual coding. So the 2 billion parameter model finished this task in roughly three minutes,
00:06:02so considerably longer. And let's see what the final result is. So as we can see, it's already
00:06:08a step up because the design looks a lot cleaner and it uses a brownish theme, which is closer to
00:06:14what a coffee shop visual identity would be. And another thing I noticed is that if we turn on Wi-Fi,
00:06:20it actually loads up some external icons, which makes the whole site look even better.
00:06:24And this version actually tried to implement the cart functionality that I initially asked for
00:06:29because we now get this nice cart sidebar, although I don't see an add to cart button on the item
00:06:35cards. And when I tried to prompt to fix these issues, once again, I got into the same technical
00:06:41issue where the model went into an infinite loop. So I figured this just might be an issue with
00:06:46LM Studio in conjunction with Klein or something of that sort. But let's be honest, obviously,
00:06:51no one would seriously consider using such small models for complex and serious coding.
00:06:56I just conducted these tests out of curiosity to see if such a small parameter count can still
00:07:02produce a meaningful result for a given coding task. So now let's do something more exciting.
00:07:07Let's try to run these models on an iPhone 14 Pro. And to do this, I built a native iOS app using
00:07:14Swift and the MLX Swift framework. And MLX is Apple's open source library that allows you to run
00:07:22models directly on the Apple silicon unified memory architecture. By leveraging the metal GPU, we can
00:07:29get these quen models running with hardware acceleration right on the device. I will also
00:07:34put a link in the description to the repo for this Swift project so you can download it and compile it
00:07:40on your own device. So as soon as we open the app, it will immediately start downloading the 0.8
00:07:46billion model. And once that is done, we are now ready to use it. But before prompting anything,
00:07:52let me switch on the airplane mode on my iPhone. So now let's start with a simple hello. For some
00:07:58reason, it replies that its name is Alex. Okay, that's very random, but okay. But did you notice
00:08:04how fast the response was streamed? I'm honestly blown away at the speed on how quickly this model
00:08:10answers you in real time. Now let's try the famous carwash test, which most models usually get wrong.
00:08:17And would you look at that quen 3.5 actually answers correctly. So that is already impressive.
00:08:23Now the coolest thing about these models is that they can also use vision capabilities. So now I'm
00:08:29going to show it an image of a banana. And let's see if it understands what it is and in what condition
00:08:35it is. So it does correctly identify that it is indeed a banana, although it says it's a dog
00:08:40banana. I've honestly never heard of this term. A dog banana? What is that? What is quen talking
00:08:47about here? Alright, but anyway, it thinks that it is overripe. And it cautions me that it might not
00:08:52be safe to eat, which is not true. I had that banana this morning, and it was delicious. But anyway,
00:08:58once again, I'm just blown away by how fast it's processing my prompt and giving me back the
00:09:04response. Now let's try another picture. Let's see if it can identify the breed of the dog in
00:09:09this picture. So here we can see that it is not quite accurate because it thinks that it sees two
00:09:15dogs, which is not true. And it does not mention the breed. So let's ask it specifically what kind
00:09:20of dog it is. So it thinks it's a golden retriever, which is obviously very far from the truth. So
00:09:27although some of the responses are not entirely accurate, and some of them are just really funny,
00:09:34I'm still genuinely impressed by the fact that such a small model can reason about contents of an
00:09:39image and do it in such a quick manner. And last thing I want to test is this model's OCR abilities,
00:09:45as it was touted in the benchmarks. Specifically, I want to see if this model can identify what is the
00:09:50language of the text content presented in this image. To give you some context, the language
00:09:55displayed in this image is Latvian, which is actually my native language, because I am
00:10:00originally from Latvia. And unfortunately, Quen fails this test because this is not Slovenian,
00:10:05nor is our language even similar to Slovenian. And I also find it funny how confidently it
00:10:11translates a word to the same word, which I'm not even sure is a real word. So clearly there are some
00:10:19heavy hallucinations going on in this prompt response. All right, let's now move to the 2 billion
00:10:25parameter model. When you switch the dropdown, it will first going to download it. And once that is
00:10:30done, we can now run the same tests on this version to see if we get some meaningful improvements. So
00:10:36let's start with the simple hello again. Okay, and at least this time, it's not Alex responding. So
00:10:42that is already an improvement. Now let's do the carwash test again. And once again, the model passes
00:10:47the carwash test. So well done there. Now let's proceed with the banana image. And this time,
00:10:53we get a more meaningful answer. It does detect that it is indeed a banana. And as for the
00:11:00condition, it says that it's fully ripe and ready to eat, which is true. Now let's try the dog picture
00:11:06again. And this one says it's a Pomeranian. I mean, I don't think these breeds are even
00:11:11relatively similar. So unfortunately, even the 2 billion model is bad at identifying dog breeds.
00:11:18And lastly, let's try the picture with the text again and see if it can identify the language.
00:11:22And look at that the 2 billion parameter model did correctly identified that this text is indeed
00:11:29Latvian. That is pretty cool. So there you have it. Those are the Quinn 3.5 small model series. I
00:11:36honestly think that despite the little inconsistencies, these are indeed the most powerful tiny models
00:11:42I've ever used. The fact that we can now have an open source native multimodal LLM running on an
00:11:49iPhone 14 Pro offline and producing meaningful results with a relatively fast inference speed
00:11:55is super impressive. So Quinn really has outdone themselves this time. Well done. But there is a
00:12:01bit of a somber update to share. As I was finishing this video, reports surface that Alibaba is
00:12:07undergoing a major restructuring of the Quinn team. Key leadership figures and top engineers behind
00:12:13these models have reportedly departed, some to pursue their own AI startups. This has left the
00:12:18community wondering if the Quinn era of rapid breakthroughs might be slowing down. It makes
00:12:24these current models even more significant as they might actually be the last major release from this
00:12:30specific team for a while. But what do you think about these small series models? Have you tried
00:12:35them? Will you use them? Let us know in the comments down below. And folks, if you like these
00:12:39types of technical breakdowns, please let me know by smashing that like button underneath the video.
00:12:45And also don't forget to subscribe to our channel. This has been Andres from Better Stack and I will
00:12:50see you in the next videos.