I Tested GLM 5.2 vs Opus 4.8 vs GPT 5.5

CChase AI
Computing/SoftwareVideo & Computer GamesInternet Technology

Transcript

00:00:00GLM 5.2 just came out this week, and it is the strongest open source model we have ever
00:00:04seen. And in some benchmarks, like you see here, even show this model outperforming the giants
00:00:10like Anthropics Opus 4.8 and OpenAI's 5.5. But are these benchmarks legit? How does this model
00:00:18compare head to head with Opus 4.8 and GPT 5.5? Well, that's exactly what we're going to answer
00:00:25in today's video, as I go through multiple tests with these three big models and see
00:00:31how it actually performs in the real world. On top of that, we'll do a deep dive on one
00:00:35benchmark in particular that I think is rather important, as well as break down what we actually
00:00:40mean by GLM 5.2 being better in some instances than Opus and GPT 5.5. Are we talking about
00:00:47it's more efficient, it costs less, or it actually does better at all those things at the same
00:00:51time? So without further ado, let's just hop into it. Now, before we jump into the head
00:00:56to head test, let's first look at some of the benchmarks that already exist comparing these
00:00:59three models. The one I want to really pay attention to is DeepSuite. Now, DeepSuite is
00:01:04a relatively new benchmark, and it's meant to be an improvement upon things like Terminal
00:01:08Bench and Terminal Bench Pro. Now, I'm not going to go ultra deep into this benchmark, you
00:01:12can check out their website or their GitHub repo, which explains it in more detail. But it focuses
00:01:17on long-running agentic tasks, specifically 113 tasks across TypeScript, Go, Python, JavaScript,
00:01:23and Rust with isolated environments and program-based verifiers. And here on this graph, we can see
00:01:29the score, the percentage it gets correct on the left-hand side, as well as the average cost
00:01:34per task. Now, we want to be up into the right. The most efficient area is over here in the top
00:01:39right. That's where we get the highest score at the lowest cost. And we can see here, GLM 5.2
00:01:44max is giving us a 44% at $3.92 per task. If we compare that to Opus 4.8 and GPT 5.5, we can see
00:01:55they do a lot better. At max, Opus 4.8 is doing 59%, and 5.5 is doing 67% at extra high. Obviously,
00:02:04at extra high and max, we have a pretty steep cost. For GPT 5.5, it's $7.23. $13 for Opus,
00:02:12and at GLM, it is $3.92. So much cheaper. However, when we look at different effort levels
00:02:19at 5.5 and at Opus, if we are at medium, for example, with Opus 4.8, we are going to score
00:02:25higher than GLM 5.2, and we're going to be less expensive. So 49% at 344 versus 44% at 392. And that's
00:02:36significant at 5.5 with a 54% at $2.75 versus 44% at $3.92. So right off the bat, on this benchmark,
00:02:47if we take it at face value, 4.8 and 5.5 are a step above GLM 5.2. And that's not surprising. These
00:02:55are the best of the best frontier models. They are not open source. And if we like really put the pedal to
00:03:01the pedal, they're going to kind of blow GLM 5.2 out of the water on these like long horizon tasks,
00:03:07kind of expected. What you might not have expected is the fact that it can do better for cheaper,
00:03:11which is kind of an issue. And I just want to get that out there because I know there is a lot of
00:03:16talking and a lot of hype right now about GLM 5.2 and the fact that it's open source. And, you know,
00:03:21that immediately kind of implies like, oh, it's super, super cheap. And we can do really good things.
00:03:25Well, I mean, by the numbers, it's good, but it's not 4.8 on 5.5 based on this benchmark. And remember,
00:03:33these 4.8 and 5.5 numbers are based on API costs. If I'm on the max plan, it's like 10x cheaper than
00:03:40this. Same thing if I'm just on like open AI's, you know, $100 a month plan or $200 a month plan. So
00:03:46that's another thing to take into account. So just kind of want to pump the brakes on like any of this
00:03:50type saying like GLM is way cheaper because it's kind of not. And even though it's open source,
00:03:56GLM 5.2, the open source model that's getting these numbers, this is not open source. Like you
00:04:01can just download this on your computer. It's open source and that like, you can see the code,
00:04:05you can see the weights. It's not open source in the sense of like, oh no, it's just, I can go get
00:04:09it on an OLAMA. I can run it on my personal PC. No, you can't. No, you can't. This is like almost a
00:04:14trillion parameters. This requires a ton of hardware to run. So don't get confused because I know
00:04:20there's a segment of the population that does, but this is just to set the stage. And again,
00:04:24this is on deep sweet stuff. This is like very intense sort of tasks that's being given. And
00:04:30today we're going to do a few different tests that are like a little bit lower level and that are
00:04:35probably more of a reflection of what you, the average user is running. So something to keep in
00:04:39mind. And just so we're all on the same page, this is what we're looking at in terms of costs
00:04:44per tokens. Remember the reason it was cheaper for Opus 4.8 and 5.5 is because it just used way less
00:04:50tokens to do what it needed to do. It was just ultimately more efficient, but on a per token basis.
00:04:55And remember for input and output, this is per million tokens, GLM 5.2, $1.40 for input,
00:05:01$4.40 for output. And Opus 4.8 is 5.7 times more expensive. And 5.5 from GPT is 6.8 times more
00:05:10expensive. So on a per token basis, much cheaper. But remember, we care about outcomes for a task,
00:05:16not necessarily a one for one token comparison. And now before we jump into the actual tests,
00:05:21a quick word from today's sponsor, me. So I just released my Cloud Code Masterclass inside of
00:05:26Chase AI Plus and it's the number one way to go from zero to AI dev, especially if you don't come
00:05:30from a technical background. I update this every single week and it also includes masterclasses for codecs
00:05:35and for creating your own agentic OS. So if this is something you want to learn more about and you're
00:05:40not sure where to start, Chase AI Plus is the place for you. There's a link to it in the pinned comments.
00:05:46So here's how we're going to run this test. We are going to give every single model the same
00:05:49prompt and plan mode. It's going to give us the plan. We may or may not do some back and forth,
00:05:53depending on what we think of the plan it comes up with. And after that, we'll let it execute.
00:05:58After it executes, I will apply my extremely subjective grading criteria to the end result and let you know
00:06:03which one I like best. If you don't like my grading criteria or what I decide is best, make sure to
00:06:08leave a comment. I will also make sure to delete your comment. Now, over here on the left, we have
00:06:14GPT 5.5 inside of Codex on extra high. We have OpenCode in the middle running GLM 5.2 on extra high
00:06:21as being routed through OpenRouter. And over here on the right, we have Cloud Code running Opus 4.8
00:06:26on high. Now, why did I choose these particular effort settings? Because that's how most people
00:06:32use these in real life. And chances are you're either on the max plan or you're on some sort of
00:06:37open AI plan and you probably aren't running it on Medium. Let's be honest. So I think this is a
00:06:42better reflection of how your average user is actually using these models day to day.
00:06:47So for our first prompt, we're going to have it build a playable 3D racing game that runs in the
00:06:51browser. And importantly, we're keeping this prompt kind of vague. I'm saying you have full freedom to
00:06:56go out on the web and pick whatever stack and library you think is best to execute this. And so
00:07:02let's go ahead and run it and see what happens. So we have all three models running in plan mode.
00:07:08And again, the thought behind making the prompt kind of vague is that we want to see as much
00:07:12divergence from these models as possible. If I gave it the exact roadmap, how to do every single thing,
00:07:18well, then we really don't get to see how these models think and how they approach more sort of
00:07:23like messy problems. So after 13 minutes, Opus 4.8 was the first one to finish creating the racing
00:07:29game. So let's take a look at what it made. So here we are kind of low poly. It does have
00:07:37some sound going on. Moves pretty smooth. Looks like we have the ability to like drift on here as well.
00:07:44Okay, the grass actually kind of messes with how the physics work. Overall, like pretty smooth, but you
00:07:54know, kind of relatively boring, right? Like this is a pretty basic racetrack. Nothing crazy didn't add
00:07:59any sort of like AI or anything like that. So I'm interested to see how the other models do in terms of
00:08:04complexity and what I'll probably do after this first test if these all are kind of like the same sort of
00:08:09bland vision. We will probably kind of give it another prompt that kind of up the ante. Next up
00:08:13is GLM 5.2. So it took about five minutes longer than Claude Code. For reference, GPT 5.5 is still
00:08:20working, which I'm not too surprised. It tends to be a little bit slower. In terms of token comparison,
00:08:26Claude Code used about 100,000 tokens to create that. And GLM 5.2 took over a million. And we can take a look
00:08:33inside of Open Router for this run, where the total spend was $1.21. And total token volume was 1.35
00:08:41million to create this game. So right away, interesting kind of track we got going on.
00:08:48Very controls are pretty jumpy, compared to what we had with Claude Code. Like I'm moving
00:08:53very fast relative to the track itself. Very fast. Like I'm screaming through this. And we're also like
00:09:00sort of just like there's no differentiation really between the track and like the field itself. And in
00:09:09certain instances, I was able to almost like you saw there, like go through the track, but not really.
00:09:15So also the car itself is a little less detailed than what we saw inside of Claude Code. I mean,
00:09:23so there is a track, it does have a timer. In terms of actual like gameplay, a little janky for what it
00:09:30is, not nearly as smooth. And also again, kind of with like the low poly situation like we saw with
00:09:36Opus. And so I'd love to see what it does if we tell it to like really create something that looks
00:09:40better. And also this track itself doesn't actually make a whole lot of sense. So now we're looking at
00:09:44what GPT 5.5 created. It calls it the foundry circuit, the night shift time trial three laps
00:09:50through the steelwork. So something different, I guess, than the generic track we've seen in the
00:09:54last two. So let's go ahead and start this. And let's go. Well, I don't actually know where I'm
00:10:04supposed to go. Oh, I guess this is the track. Wheels look kind of interesting. They're kind of
00:10:10spinning the wrong way. So that's something. Okay, it has like very annoying noises, actually.
00:10:21And I kind of can't get over the wheels going horizontal, or however you'd even describe this.
00:10:28track itself is fine can kind of move. Yeah, you can go past the track and it slows you down. But it's not
00:10:35like clear that this is like a paved track, like we saw with what Opus built. And like the rest is,
00:10:41you know, say the, you know, field. So kind of strange graphics, honestly. Also, when you consider
00:10:48the fact that like twice as long as Opus is kind of weird. Yeah, honestly, kind of strange. Again,
00:10:55like why, why did it do this with the wheels? I have no idea. Again, went for the low poly thing.
00:11:00And it's just like very dark, for like seemingly no reason. So I mean, like, I almost, I feel like
00:11:06this is more functional than what we got with GLM 5.2, but like, not that much better. And you also
00:11:12consider the fact that this was on extra high on 5.5. Now in terms of token usage for 5.5,
00:11:17it came out to roughly what we saw with Claude Code. It used 7% of its five hour window. So almost
00:11:22nothing. Now, overall ranking, I would have put Opus 4.8 clearly ahead of GLM 5.2 and 5.5. I thought
00:11:28the latter two were kind of janky, but we're actually going to give them another shot because
00:11:32we're going to tell them to take another look at the code, do another pass. And we're also want them
00:11:36to do a lot better in terms of the graphics. I don't want the low poly stuff. I want this to look
00:11:40like a triple A game or as close to it as possible. So let's see what happens when we give them
00:11:46attempt number two. So Opus and GLM finished their second pass and 5.5 is finishing up there. So
00:11:50let's take a look at Opus 4.8 first. So right away, we see a car that is way better. Like this is a huge
00:11:58improvement in regards to the car than what we saw before. We also see a lot different lighting.
00:12:04Like you can see the sun reflected on the ground itself and everything looks way more smooth. I mean,
00:12:10the trees themselves are kind of like low polygon type deals, but the lighting and especially the car
00:12:15are a major step forward. And it still keeps sort of that same smooth gameplay. I mean, besides the
00:12:20fact we have trees in the road, but the trees themselves are also shadowed. And for one additional
00:12:26pass that took 10 minutes and about 50,000 tokens, not bad. Now we'll look at GLM. And at this point,
00:12:32it took about another 1.2 or so million tokens to make this update, putting our total spend at $1.83.
00:12:38So let's start it up. And it looks like it tried to add some sort of different lighting. The car looks
00:12:46a little bit better, but the lighting itself is kind of strange. Like it's just very glary. The track
00:12:52itself hasn't changed a whole lot. You know, it's still kind of just like grass everywhere. And the
00:12:57controls are still very jumpy, right? Like I'm going very fast relative to the track. Same sort of issue
00:13:04that I had before where like some of the track I can pass through some of it, I can't. So I mean,
00:13:10the graphics for the car look better, but I would argue the lighting and the glare is so distracting.
00:13:15It's probably kind of a downgrade to what we had prior. And here is the second pass with 5.5. Now
00:13:21the car looks a little bit better, but looking at everything else, this is kind of the same. Well,
00:13:29the wheels are better. We fixed the wheel issue. They're actually turning the way wheels should,
00:13:34but still has annoying noises. And there's no real differentiation again, between like the path
00:13:42and like the grass. So it kind of feels like sort of the exact same thing it did the first time with a
00:13:49slightly better car. But you know, when we told it go for like a triple A aesthetic, I wouldn't say it
00:13:55hit the mark. And again, I feel like big picture. We look at these three GM and 5.5, definitely a step
00:14:02below Opus. Now for our next test, we're going to have it build us a website. And the prompt we're going
00:14:07to be using is this. We want it to build a fake landing page for a product, which is AI powered
00:14:12smart glasses. Think something like meta Ray-Bans. Again, we're giving these models full freedom in
00:14:16terms of the stack and design. We're telling it to pick whatever we think is best, install what we
00:14:20need and look up the best practices for creating landing pages. We're telling it, Hey, go ahead and find
00:14:25images and product shots. And don't just rely on creating your own sort of HTML stuff. And importantly,
00:14:31we're saying, make it look like an award site. We don't want it to look like AI slop. We want real
00:14:35visual hierarchy, intentional typography, and motion where it makes sense. So landing page for smart
00:14:42glasses, we want it to be sort of award style. So let's see what they come up with. So all three
00:14:46of them finished up for reference, GLM used about a million tokens to execute this while Opus and 5.5
00:14:53used about a hundred thousand, give or take. So first up we have what Opus build us very dark background.
00:14:58It has sort of these glasses it created, and the text is sort of cut off right here, which is
00:15:04unfortunate. As we scroll down, this is also kind of oddly placed because we can see the scroll text
00:15:12kind of over the top of it. But as I mouse over, you can see kind of like move around and it changes
00:15:18color, which is kind of cool. As I scroll down, we have some scrolling sort of like loading animations
00:15:24for everything. But all in all, it looks fine for the glasses themselves that use like HTML.
00:15:31So it's like, what are you really getting out of this? It didn't even like find some sort of glasses
00:15:35to use. And it has, you know, hey, here's how you can reserve it and here's how you can buy it. So
00:15:41it's fine. Again, we didn't give it a ton of direction, but we told it to go for like an awards
00:15:45type look. I would not consider it on that sort of level. Now let's take a look at what GLM built us.
00:15:51And I don't actually know what's going on here at all. In fact, this is kind of like barely loaded.
00:15:59It shows us some glasses, but like this website's kind of like a disaster. It's like it didn't even
00:16:04really finish this. It almost like just threw it all together. Yeah. Yeah, the prompt wasn't super
00:16:13detailed, but it should be able to do more than this based on what I give it. This is like actually
00:16:19terrible. I have no idea what it actually was trying to accomplish here. And lastly, we have GPT 5.5. So
00:16:25this is a little bit interesting. I think it looks kind of cool, although the glasses
00:16:30somewhat overlap the text here. And we have a lot of dead space, which you could argue that's
00:16:34something of a design choice. And we have the banner that actually moves, you'll remember the
00:16:39Opus version did have a banner, but it wasn't moving. And then as we scroll down, you'll also notice the
00:16:44cursor is kind of like multicolored. And as we scroll down, it looks like it created some HTML
00:16:50type assets. I mean, strange, right? We did tell it, hey, you can go find what you need to find online
00:16:55if you want to. But overall, probably the best out of the three. But, you know, I wouldn't say I was in
00:17:04love with any of these kind of shows you how like strong of a hand you need to take when doing any
00:17:09sort of like visual design or like UI type things like even these most advanced models struggle like
00:17:14I actually have no idea what the heck is going on. Like this is this is a mess. So overall, Opus was
00:17:21okay. 5.5 was the best of the bunch and GLM was like actually a complete failure. And just like we did with
00:17:26the gaming version, we're going to give them a second pass at this and see if they can clean up what went
00:17:30wrong. And on top of that, we're going to ask them to integrate similar to, again, the game we had them
00:17:36create some like three JS elements, like we really want to see how it can sort of push its capabilities with
00:17:42sort of like motion and like graphics and that sort of thing. And that new prompt looks like this take
00:17:46the smart glasses landing page you've just built and rebuild it as an immersive 3d experience using
00:17:51three dot JS. So we want an actual interactive 3d scene. And again, we're giving it full freedom to
00:17:56execute it as it sees fit. And so here's what we got with Opus 4.8. You can see now that it added
00:18:02some three JS, these glasses sort of move. But beyond that, we have some of the original issues,
00:18:08right, the text being cut off, it being overwritten right here. And the rest of this kind of just being
00:18:13like, man, like this is pretty like, obvious that AI created this. Oh, end of note, like token costs
00:18:21were pretty much equal on the second run across the board to the first run for all these. Next,
00:18:27we have GLM 5.2. And this time it actually created a website that makes sense. We have these glasses,
00:18:32although the glasses that made are kind of like odd, like you only have like, you know,
00:18:36no glasses would actually look like that in the text is also cut off here. But we have a banner
00:18:42that does scroll when I scroll over the top of it, it does stop. And I would say overall, in terms of
00:18:48like how it laid out the website, I would probably give it the edge over Opus. Now, I don't think
00:18:55either of them are particularly good. And we kind of gave them free range to do whatever they want. But
00:18:59I would put this over kind of this setup. Although in terms of like the hero section itself,
00:19:05I do like Opus 4.8 better. Now, GPT 5.5, I think is the winner here. I think this just looks
00:19:10better overall from a subjective design standpoint. And I think the three JS sort of like motion graphics
00:19:18it added here are pretty cool. I think it makes sense in context of what it created. Like we have
00:19:22all this white space up top and the glasses kind of, you know, are able to live in there. And as for the
00:19:27rest of the website, I think it looks fine. Again, it still looks very like, quote unquote,
00:19:32AI slop in the sense that AI definitely created this, but it doesn't look bad. And like from top
00:19:37to bottom, I do prefer what 5.5 gave us over all the others. And so when we take a look at this whole
00:19:42thing, holistically bringing in these more sophisticated benchmarks, like deep sweet alongside
00:19:48what we just did today, I think this is kind of what we expected. I don't think GLM did extremely poorly
00:19:56in any sense of the word, but it definitely felt like it was a step below GPT 5.5 or 4.8 or in
00:20:03scenarios where, you know, in the first section where Opus was better than all of them. And in
00:20:07the second section where GPT was better than all of them, GLM was always near the bottom. It wasn't
00:20:12grossly worse than any of them, but it certainly wasn't better. And it also used infinitely more tokens.
00:20:17And so when we take a look at something like this, the deep sweet score, where it's like,
00:20:21hey, GLM is kind of at the bottom and actually is less efficient than 5.5 and 4.8, both in terms of
00:20:27cost and how well it does. It kind of makes sense. I think this is kind of what we see. And so big
00:20:35picture, is GLM a great open source model? Definitely. But does it run into some issues that open source
00:20:41models have in general, namely, they aren't as powerful? Yes. And furthermore, if you are someone who is
00:20:47open source maxing, understand this is not something you would run on your PC, right? This requires a
00:20:52ton of hardware to use. And I think what gets lost in the conversation is what we talked about at the
00:20:57beginning, which is like, okay, the costs are kind of already a problem for GLM 5.2. Yet this doesn't
00:21:05even take into account the huge subsidization you get on either the Anthropic Max plan or the OpenAI
00:21:12Max plan. So you put that in mind and like, okay, like the kind of isn't a debate.
00:21:16It really isn't a debate. So would I suggest using GLM 5.2 for your average person? No,
00:21:24not really. I think maybe if you're doing lower level tasks and you're someone who's comparing
00:21:29it purely on API prices, maybe, maybe. But it's, you know, I think it's kind of hard to argue that
00:21:38because then what are we doing when the next, when, you know, Sonnet 5 comes out next week? Like,
00:21:42are you just going to jump from there to there? Like there's something to be said with just like
00:21:46sticking with the model, especially when we're talking more like enterprise team level stuff,
00:21:50where the API costs really start to add up. Because again, for the average single user who's going to
00:21:55be using one of the subsidized plans and isn't paying straight up API costs, I don't see an argument for
00:22:01GLM 5.2. So that's where I'm going to leave you guys for today. Hopefully I shed some light on this
00:22:05whole GLM debate and all the hype that you see coming out around it. As always, let me know what you
00:22:09thought in the comments. Make sure to check out Chase AI Plus if you want to get your hands on the
00:22:13in the Cloud Code Masterclass, and I'll see you around.

Key Takeaway

While GLM 5.2 is a capable open-source model, it consistently underperforms against frontier models like Opus 4.8 and GPT 5.5 in complex agentic tasks and actual token efficiency.

Highlights

  • GLM 5.2 scores 44% accuracy on DeepSuite agentic tasks compared to 59% for Opus 4.8 and 67% for GPT 5.5.

  • Cost per task on DeepSuite is $3.92 for GLM 5.2, $13.00 for Opus 4.8 at max settings, and $7.23 for GPT 5.5 at extra high settings.

  • GLM 5.2 consumes over 1 million tokens for tasks where Opus 4.8 and GPT 5.5 require approximately 100,000 tokens.

  • Despite lower API prices per million tokens, GLM 5.2 is often less efficient in total cost per task due to higher token consumption.

  • GLM 5.2 requires significant enterprise-grade hardware to run, contradicting the perception that it is a lightweight, local-runnable model.

Timeline

Benchmarking GLM 5.2 vs Frontier Models

  • DeepSuite results show GLM 5.2 achieving 44% accuracy on long-horizon agentic tasks.
  • Opus 4.8 and GPT 5.5 achieve higher accuracy scores of 59% and 67% respectively on the same benchmark.
  • Per-token pricing is significantly lower for GLM 5.2, but total cost per task is often higher because the model uses more tokens to complete assignments.

The DeepSuite benchmark evaluates models across 113 tasks in languages including TypeScript, Go, Python, JavaScript, and Rust. Although GLM 5.2 is marketed as a cost-effective alternative, the data shows that its inefficiency in token usage offsets its lower base price per million tokens. Furthermore, users often benefit from heavily subsidized plans for Opus and GPT, making the cost argument against these frontier models less compelling.

Agentic Performance in 3D Game Development

  • Opus 4.8 consistently delivers smoother, more functional 3D game environments compared to GLM 5.2 and GPT 5.5.
  • GLM 5.2 and GPT 5.5 both struggle with physics, graphical fidelity, and inconsistent control schemes during iterative testing.
  • Token usage for GLM 5.2 exceeds 1 million tokens for a single game build, while competitive models achieve better results with approximately 100,000 tokens.

Models were tested on their ability to build a 3D racing game in a browser environment using a vague prompt. Opus 4.8 finished first and provided the most playable results. Subsequent passes to improve graphics demonstrated that while all models improved slightly, GLM 5.2's output remained the least stable and most resource-intensive.

Website Landing Page Design Capabilities

  • GPT 5.5 provides the most visually cohesive landing page design compared to GLM 5.2 and Opus 4.8.
  • GLM 5.2 fails to produce a functional or well-laid-out website in the initial test, requiring a second pass to achieve basic utility.
  • None of the models fully capture high-end 'award-style' design standards, as they all struggle with layout, typography, and visual hierarchy.

The models were tasked with building a landing page for smart glasses that incorporated motion and 3D elements via Three.js. GLM 5.2 performed poorly, initially failing to generate a usable site, while GPT 5.5 eventually produced the best-received result after being pushed to include 3D assets. Ultimately, the testing confirms that GLM 5.2 remains a tier below the leading frontier models in both design execution and efficiency.

Community Posts

View all posts