Qwen 3.5 35B vs Sonnet 4.5: Is The Gap CLOSING?

BBetter Stack
Computing/SoftwareConsumer ElectronicsInternet Technology

Transcript

00:00:00Earlier this month Alibaba released Qwend 3.5 with a 400 billion parameter model and
00:00:05a max thinking one that claims to have better benchmarks than Opus 4.5 with beefy requirements
00:00:11to run locally.
00:00:12But just this week they released the Medium Series Qwend 3.5 models that are almost as
00:00:17powerful as their max ones and have the ability to run locally on a modern MacBook Pro, claiming
00:00:22to also have better benchmarks than Sonnet 4.5, which I don't believe, so hit subscribe
00:00:27and let's put these two models to the test.
00:00:31Most developers will admit that Sonnet 4.5 is a great model, working well with Claude
00:00:35Code, Co-Work and the whole Anthropic suite making the experience feel premium.
00:00:40But you have to be online for these models to work and they're not that cheap.
00:00:44Qwend 3.5's Medium Series aims to change all of that by making it possible to run a
00:00:49model as good as Sonnet 4.5 locally and people on Twitter are going crazy.
00:00:54But I'm not convinced it's actually as good as Sonnet 4.5.
00:00:58So I'm going to test both these models on an easy, medium and hard task and see which
00:01:02one performs better.
00:01:04But before we get into the testing, I have a small confession to make.
00:01:07I'm not actually going to run Qwend 3.5 locally because my measly M1 MacBook Pro doesn't
00:01:12have the unified memory to run inference properly.
00:01:15So I'm going to be using Qwend 3.5 35b on OpenRouter connected to OpenCode and I'm
00:01:21going to be running Sonnet 4.5 in Claude Code on clean mode, so it's not using any of my
00:01:25skills, plugins or MCP tools.
00:01:27We'll start simple and ask the models to build a to-do list from scratch using React and VeeT.
00:01:32So if we look at what Sonnet 4.5 produced, we can see it has this AI purple.
00:01:36I can add a to-do item and I can mark it as completed, I have the ability to clear and
00:01:40if I refresh the page, it all stays there because it's used local storage.
00:01:44If you look at Qwend 3.5, they both have a similar styling and haven't overwritten the
00:01:48default styling that comes with VeeT.
00:01:51But again, I can add a to-do item.
00:01:53And here we have a few other options.
00:01:54So we can choose the category that it goes into, we can choose the I think severity and
00:01:59maybe a to-do date or a date that it's due.
00:02:02So I can say something like do shopping and it shows the to-do date and the severity and
00:02:06the category that it's in, which is really cool.
00:02:08Let's take a look at the code.
00:02:09So this is from Sonnet and over here, it's using a use of Flex, which I think is to do
00:02:13with the local storage down here.
00:02:15I guess it's fine, but I'd rather find it a different way.
00:02:17We have an add to-do being used here and we have some functions over here to perform actions.
00:02:22So toggle the to-do, here we have delete to-do.
00:02:25All of this looks good.
00:02:26And one thing that I'm a bit shocked about is the bit up here that mentions the JSON passing.
00:02:32So it looks like it's saving it in local storage as JSON and then passing it.
00:02:35And it would have been nice to have this code in a separate function so that if you want
00:02:38to add more things to it, it wouldn't clog up the top of the code over here.
00:02:42Now, if we look at Qwend, we have some categories, it doesn't look like any use effect is being
00:02:46used, which is good.
00:02:48And if we scroll down, we have handle submit, which is a name I would prefer to use.
00:02:51And we also have handle updates, handle delete and handle toggle completed.
00:02:55And one thing I really like about this is it put the to-do items in a separate component.
00:02:59So instead of clocking up the main components, so the main to-do app component, it created
00:03:03a new component over here, which is used down here in the app section since there are multiple
00:03:07to-do items.
00:03:08So the win goes to Qwend because it produced a to-do list with many more features.
00:03:13But after I ran these tests, I realised that Qwend had the superpower skill enabled in open
00:03:18code.
00:03:19So I ran it again without the skill and this is the result we got.
00:03:23So I guess the win goes to Sonnet.
00:03:25Let's move on to the second test, which is to build an interactive solar system using
00:03:29React, Veeet and 3JS.
00:03:31Claude did a much better job in one shot.
00:03:33Okay, it is missing a few planets, but I can click on the ones that exist.
00:03:37I click on the sun and get some information about it.
00:03:39I click on Uranus down here and also get some information about it.
00:03:44The movement on the site is also flawless, so I can pan, rotate, zoom in and out and so
00:03:48on.
00:03:49And here is what Qwend produced.
00:03:50Yes, a blank page.
00:03:51If we take a look at the console, we can see there's an error here that I did pass to Qwend
00:03:56multiple times, but it wasn't able to solve.
00:03:58In fact, the whole process of creating this was quite cumbersome.
00:04:01Qwend did go to sleep a few times and I had to wake it up and it also struggled to fix
00:04:05errors over and over again.
00:04:06Not to mention, if we take a look at the files produced by Qwend, we have a package JSON here,
00:04:10a package lock and a node modules directory, which was not used at all because the main
00:04:15project is inside the solar system directory and a proper package JSON as well as a proper
00:04:20node modules directory.
00:04:21So for test number two, Claude also wins.
00:04:23For the final test, I got these models to modify an existing code base to take a screenshot
00:04:28of a tweet when the user posts the URL inside the app.
00:04:32We'll start off with Claude, which produced the screen page over here.
00:04:35Give me the option to change the background and padding.
00:04:38Now, the first time I ran this, I did get an error, which I asked Claude to fix.
00:04:42I'm going to copy the URL for this tweet by JSON, paste it in here and click capture.
00:04:47And after a few seconds, we get the image down here with the option to download it.
00:04:51And here is the result from Qwend with a screen page over here.
00:04:54Again, I'm going to copy this tweet, paste it here.
00:04:56It says extract video instead of extract screenshot and it starts to capture it, which looks promising.
00:05:01But after a while, we hit a 60 second timeout, which is similar to the error we experienced
00:05:06with Sonnet.
00:05:07But I did ask Qwend to fix it and it did extend the timeout, but it didn't fix the issue
00:05:11that caused it in the first place.
00:05:13So it looks like Sonnet 4.5 wins all three tests.
00:05:17So even though on paper, Qwend 3.5/35b should outperform Sonnet 4.5, in real world testing
00:05:24that doesn't seem to be the case.
00:05:26And don't get me wrong, it's really impressive that you can run a 35 billion or even 27 billion
00:05:31parameter model locally on a modern MacBook.
00:05:34But regardless of what people on Twitter are saying about it, there's no way it can outproduce
00:05:38Sonnet 4.5 on coding tasks, as you can see from the tests I ran earlier.
00:05:42So why do the benchmarks make it look so good?
00:05:45Well, there is a huge chance that Qwend 3.5 was post trained on specific benchmark questions
00:05:51like Sweebench verified so that it performs well on those questions.
00:05:55But a model like Sonnet 4.5 would have been post trained on a much broader and robust dataset,
00:06:01making it handle more nuanced tasks.
00:06:03But not to mention the Qwend model I tested had 35 billion parameters, but only use 3 billion
00:06:08during inference.
00:06:09Whereas even though Anthropic don't publish their numbers, looking at estimations, Sonnet
00:06:143 could have been trained on 70 billion parameters, and there's no doubt Sonnet 4.5 would have
00:06:18much more.
00:06:19So it's not really fair to compare these models on benchmarks alone.
00:06:23It's always important to do your own research and run your own evals.
00:06:26I mean, there is a reason why Qwend 3.5 wasn't included on the model list for OpenCode Go.
00:06:31While we're on the topic of Qwend, their TTS model was recently released and Joss has
00:06:35a great video covering it for voice cloning, emotions in voice and so much more, which you
00:06:39can check out here.

Key Takeaway

Despite impressive benchmark claims and local execution capabilities, Qwen 3.5 35B fails to match the real-world coding reliability and sophisticated logic of Anthropic's Sonnet 4.5.

Highlights

Alibaba recently released the Qwen 3.5 Medium Series (35B), which claims to outperform Anthropic's Sonnet 4.5 in benchmarks.

Sonnet 4.5 consistently outperformed Qwen 3.5 in real-world coding tasks including React development and 3JS integration.

Qwen 3.5 35B can run locally on modern MacBooks, whereas Sonnet 4.5 requires an internet connection and paid API access.

The creator suggests Qwen's high benchmark scores may be due to post-training specifically on benchmark datasets like SWE-bench.

Sonnet 4.5's superior performance in complex, nuanced tasks is attributed to a likely larger parameter count and more robust training data.

Initial 'wins' for Qwen in simple tasks were debunked once external AI 'skills' and tools were disabled for a fair comparison.

Timeline

Introduction and Model Capabilities

The video introduces the new Qwen 3.5 Medium Series from Alibaba, which features a 35B parameter model designed to run on consumer hardware like a MacBook Pro. These models claim to beat Anthropic's Sonnet 4.5 in benchmarks, a claim the narrator aims to verify through hands-on testing. While Sonnet 4.5 is praised for its premium integration with the Anthropic suite, it is noted as being expensive and requiring an online connection. Qwen 3.5 35B represents a shift toward powerful local LLMs that could potentially democratize high-level coding assistance. The creator sets the stage for a three-part testing gauntlet covering easy, medium, and hard coding challenges.

Test 1: Simple To-Do List in React

The first test involves building a basic to-do list application using React and Vite. Sonnet 4.5 produced a functional, aesthetically pleasing app with local storage persistence, though its code structure was slightly cluttered. Qwen 3.5 initially appeared to win by including advanced features like categories and severity levels with a cleaner component architecture. However, the narrator discovered that Qwen had an unfair advantage through enabled 'superpower skills' in the OpenCode environment. When re-tested without these aids, Sonnet 4.5 proved to be the more reliable and consistent model for the task.

Test 2: Interactive Solar System with 3JS

The medium-difficulty task required the models to create a 3D interactive solar system using 3JS. Sonnet 4.5 successfully generated a functional scene with smooth rotation, panning, and interactive planetary data in a single attempt. In contrast, Qwen 3.5 failed significantly, producing a blank page and numerous console errors that it could not resolve even after multiple prompts. The model also displayed erratic behavior, such as 'falling asleep' and creating redundant, messy file structures with incorrect directory nesting. This section highlights a significant gap in the models' ability to handle complex libraries and multi-step logic.

Test 3: Feature Modification and Bug Fixing

For the final hard test, both models were tasked with modifying an existing codebase to add a tweet-to-screenshot capture feature. Sonnet 4.5 initially hit a timeout error but was able to fix its own bug and successfully deliver the image download functionality. Qwen 3.5 attempted the task but failed to overcome a 60-second timeout issue, even after being explicitly asked to extend the limit. It struggled to identify the root cause of the execution failure, whereas Sonnet demonstrated better self-correction and debugging capabilities. This confirms that Sonnet 4.5 remains the superior choice for enterprise-level or complex development workflows.

Final Analysis and Performance Theories

The creator concludes that while Qwen 3.5 is impressive for a 35B parameter local model, it cannot yet compete with the likes of Sonnet 4.5. He theorizes that Qwen's high benchmark scores are likely the result of intensive training on specific evaluation datasets rather than general reasoning improvements. The discussion points out that Sonnet likely utilizes a much higher parameter count and more diverse training data, giving it a 'reasoning' edge that benchmarks miss. This section emphasizes the importance of performing real-world evaluations over relying solely on marketing numbers. Finally, the video mentions Qwen's separate successes in the text-to-speech (TTS) space as a secondary area of interest.

Community Posts

View all posts