I Tested GLM 5.2 vs Opus 4.8 vs GPT 5.5
CChase AI
Computing/SoftwareVideo & Computer GamesInternet Technology
Transcript
00:00:00GLM 5.2 just came out this week, and it is the strongest open source model we have ever
00:00:04seen. And in some benchmarks, like you see here, even show this model outperforming the giants
00:00:10like Anthropics Opus 4.8 and OpenAI's 5.5. But are these benchmarks legit? How does this model
00:00:18compare head to head with Opus 4.8 and GPT 5.5? Well, that's exactly what we're going to answer
00:00:25in today's video, as I go through multiple tests with these three big models and see
00:00:31how it actually performs in the real world. On top of that, we'll do a deep dive on one
00:00:35benchmark in particular that I think is rather important, as well as break down what we actually
00:00:40mean by GLM 5.2 being better in some instances than Opus and GPT 5.5. Are we talking about
00:00:47it's more efficient, it costs less, or it actually does better at all those things at the same
00:00:51time? So without further ado, let's just hop into it. Now, before we jump into the head
00:00:56to head test, let's first look at some of the benchmarks that already exist comparing these
00:00:59three models. The one I want to really pay attention to is DeepSuite. Now, DeepSuite is
00:01:04a relatively new benchmark, and it's meant to be an improvement upon things like Terminal
00:01:08Bench and Terminal Bench Pro. Now, I'm not going to go ultra deep into this benchmark, you
00:01:12can check out their website or their GitHub repo, which explains it in more detail. But it focuses
00:01:17on long-running agentic tasks, specifically 113 tasks across TypeScript, Go, Python, JavaScript,
00:01:23and Rust with isolated environments and program-based verifiers. And here on this graph, we can see
00:01:29the score, the percentage it gets correct on the left-hand side, as well as the average cost
00:01:34per task. Now, we want to be up into the right. The most efficient area is over here in the top
00:01:39right. That's where we get the highest score at the lowest cost. And we can see here, GLM 5.2
00:01:44max is giving us a 44% at $3.92 per task. If we compare that to Opus 4.8 and GPT 5.5, we can see
00:01:55they do a lot better. At max, Opus 4.8 is doing 59%, and 5.5 is doing 67% at extra high. Obviously,
00:02:04at extra high and max, we have a pretty steep cost. For GPT 5.5, it's $7.23. $13 for Opus,
00:02:12and at GLM, it is $3.92. So much cheaper. However, when we look at different effort levels
00:02:19at 5.5 and at Opus, if we are at medium, for example, with Opus 4.8, we are going to score
00:02:25higher than GLM 5.2, and we're going to be less expensive. So 49% at 344 versus 44% at 392. And that's
00:02:36significant at 5.5 with a 54% at $2.75 versus 44% at $3.92. So right off the bat, on this benchmark,
00:02:47if we take it at face value, 4.8 and 5.5 are a step above GLM 5.2. And that's not surprising. These
00:02:55are the best of the best frontier models. They are not open source. And if we like really put the pedal to
00:03:01the pedal, they're going to kind of blow GLM 5.2 out of the water on these like long horizon tasks,
00:03:07kind of expected. What you might not have expected is the fact that it can do better for cheaper,
00:03:11which is kind of an issue. And I just want to get that out there because I know there is a lot of
00:03:16talking and a lot of hype right now about GLM 5.2 and the fact that it's open source. And, you know,
00:03:21that immediately kind of implies like, oh, it's super, super cheap. And we can do really good things.
00:03:25Well, I mean, by the numbers, it's good, but it's not 4.8 on 5.5 based on this benchmark. And remember,
00:03:33these 4.8 and 5.5 numbers are based on API costs. If I'm on the max plan, it's like 10x cheaper than
00:03:40this. Same thing if I'm just on like open AI's, you know, $100 a month plan or $200 a month plan. So
00:03:46that's another thing to take into account. So just kind of want to pump the brakes on like any of this
00:03:50type saying like GLM is way cheaper because it's kind of not. And even though it's open source,
00:03:56GLM 5.2, the open source model that's getting these numbers, this is not open source. Like you
00:04:01can just download this on your computer. It's open source and that like, you can see the code,
00:04:05you can see the weights. It's not open source in the sense of like, oh no, it's just, I can go get
00:04:09it on an OLAMA. I can run it on my personal PC. No, you can't. No, you can't. This is like almost a
00:04:14trillion parameters. This requires a ton of hardware to run. So don't get confused because I know
00:04:20there's a segment of the population that does, but this is just to set the stage. And again,
00:04:24this is on deep sweet stuff. This is like very intense sort of tasks that's being given. And
00:04:30today we're going to do a few different tests that are like a little bit lower level and that are
00:04:35probably more of a reflection of what you, the average user is running. So something to keep in
00:04:39mind. And just so we're all on the same page, this is what we're looking at in terms of costs
00:04:44per tokens. Remember the reason it was cheaper for Opus 4.8 and 5.5 is because it just used way less
00:04:50tokens to do what it needed to do. It was just ultimately more efficient, but on a per token basis.
00:04:55And remember for input and output, this is per million tokens, GLM 5.2, $1.40 for input,
00:05:01$4.40 for output. And Opus 4.8 is 5.7 times more expensive. And 5.5 from GPT is 6.8 times more
00:05:10expensive. So on a per token basis, much cheaper. But remember, we care about outcomes for a task,
00:05:16not necessarily a one for one token comparison. And now before we jump into the actual tests,
00:05:21a quick word from today's sponsor, me. So I just released my Cloud Code Masterclass inside of
00:05:26Chase AI Plus and it's the number one way to go from zero to AI dev, especially if you don't come
00:05:30from a technical background. I update this every single week and it also includes masterclasses for codecs
00:05:35and for creating your own agentic OS. So if this is something you want to learn more about and you're
00:05:40not sure where to start, Chase AI Plus is the place for you. There's a link to it in the pinned comments.
00:05:46So here's how we're going to run this test. We are going to give every single model the same
00:05:49prompt and plan mode. It's going to give us the plan. We may or may not do some back and forth,
00:05:53depending on what we think of the plan it comes up with. And after that, we'll let it execute.
00:05:58After it executes, I will apply my extremely subjective grading criteria to the end result and let you know
00:06:03which one I like best. If you don't like my grading criteria or what I decide is best, make sure to
00:06:08leave a comment. I will also make sure to delete your comment. Now, over here on the left, we have
00:06:14GPT 5.5 inside of Codex on extra high. We have OpenCode in the middle running GLM 5.2 on extra high
00:06:21as being routed through OpenRouter. And over here on the right, we have Cloud Code running Opus 4.8
00:06:26on high. Now, why did I choose these particular effort settings? Because that's how most people
00:06:32use these in real life. And chances are you're either on the max plan or you're on some sort of
00:06:37open AI plan and you probably aren't running it on Medium. Let's be honest. So I think this is a
00:06:42better reflection of how your average user is actually using these models day to day.
00:06:47So for our first prompt, we're going to have it build a playable 3D racing game that runs in the
00:06:51browser. And importantly, we're keeping this prompt kind of vague. I'm saying you have full freedom to
00:06:56go out on the web and pick whatever stack and library you think is best to execute this. And so
00:07:02let's go ahead and run it and see what happens. So we have all three models running in plan mode.
00:07:08And again, the thought behind making the prompt kind of vague is that we want to see as much
00:07:12divergence from these models as possible. If I gave it the exact roadmap, how to do every single thing,
00:07:18well, then we really don't get to see how these models think and how they approach more sort of
00:07:23like messy problems. So after 13 minutes, Opus 4.8 was the first one to finish creating the racing
00:07:29game. So let's take a look at what it made. So here we are kind of low poly. It does have
00:07:37some sound going on. Moves pretty smooth. Looks like we have the ability to like drift on here as well.
00:07:44Okay, the grass actually kind of messes with how the physics work. Overall, like pretty smooth, but you
00:07:54know, kind of relatively boring, right? Like this is a pretty basic racetrack. Nothing crazy didn't add
00:07:59any sort of like AI or anything like that. So I'm interested to see how the other models do in terms of
00:08:04complexity and what I'll probably do after this first test if these all are kind of like the same sort of
00:08:09bland vision. We will probably kind of give it another prompt that kind of up the ante. Next up
00:08:13is GLM 5.2. So it took about five minutes longer than Claude Code. For reference, GPT 5.5 is still
00:08:20working, which I'm not too surprised. It tends to be a little bit slower. In terms of token comparison,
00:08:26Claude Code used about 100,000 tokens to create that. And GLM 5.2 took over a million. And we can take a look
00:08:33inside of Open Router for this run, where the total spend was $1.21. And total token volume was 1.35
00:08:41million to create this game. So right away, interesting kind of track we got going on.
00:08:48Very controls are pretty jumpy, compared to what we had with Claude Code. Like I'm moving
00:08:53very fast relative to the track itself. Very fast. Like I'm screaming through this. And we're also like
00:09:00sort of just like there's no differentiation really between the track and like the field itself. And in
00:09:09certain instances, I was able to almost like you saw there, like go through the track, but not really.
00:09:15So also the car itself is a little less detailed than what we saw inside of Claude Code. I mean,
00:09:23so there is a track, it does have a timer. In terms of actual like gameplay, a little janky for what it
00:09:30is, not nearly as smooth. And also again, kind of with like the low poly situation like we saw with
00:09:36Opus. And so I'd love to see what it does if we tell it to like really create something that looks
00:09:40better. And also this track itself doesn't actually make a whole lot of sense. So now we're looking at
00:09:44what GPT 5.5 created. It calls it the foundry circuit, the night shift time trial three laps
00:09:50through the steelwork. So something different, I guess, than the generic track we've seen in the
00:09:54last two. So let's go ahead and start this. And let's go. Well, I don't actually know where I'm
00:10:04supposed to go. Oh, I guess this is the track. Wheels look kind of interesting. They're kind of
00:10:10spinning the wrong way. So that's something. Okay, it has like very annoying noises, actually.
00:10:21And I kind of can't get over the wheels going horizontal, or however you'd even describe this.
00:10:28track itself is fine can kind of move. Yeah, you can go past the track and it slows you down. But it's not
00:10:35like clear that this is like a paved track, like we saw with what Opus built. And like the rest is,
00:10:41you know, say the, you know, field. So kind of strange graphics, honestly. Also, when you consider
00:10:48the fact that like twice as long as Opus is kind of weird. Yeah, honestly, kind of strange. Again,
00:10:55like why, why did it do this with the wheels? I have no idea. Again, went for the low poly thing.
00:11:00And it's just like very dark, for like seemingly no reason. So I mean, like, I almost, I feel like
00:11:06this is more functional than what we got with GLM 5.2, but like, not that much better. And you also
00:11:12consider the fact that this was on extra high on 5.5. Now in terms of token usage for 5.5,
00:11:17it came out to roughly what we saw with Claude Code. It used 7% of its five hour window. So almost
00:11:22nothing. Now, overall ranking, I would have put Opus 4.8 clearly ahead of GLM 5.2 and 5.5. I thought
00:11:28the latter two were kind of janky, but we're actually going to give them another shot because
00:11:32we're going to tell them to take another look at the code, do another pass. And we're also want them
00:11:36to do a lot better in terms of the graphics. I don't want the low poly stuff. I want this to look
00:11:40like a triple A game or as close to it as possible. So let's see what happens when we give them
00:11:46attempt number two. So Opus and GLM finished their second pass and 5.5 is finishing up there. So
00:11:50let's take a look at Opus 4.8 first. So right away, we see a car that is way better. Like this is a huge
00:11:58improvement in regards to the car than what we saw before. We also see a lot different lighting.
00:12:04Like you can see the sun reflected on the ground itself and everything looks way more smooth. I mean,
00:12:10the trees themselves are kind of like low polygon type deals, but the lighting and especially the car
00:12:15are a major step forward. And it still keeps sort of that same smooth gameplay. I mean, besides the
00:12:20fact we have trees in the road, but the trees themselves are also shadowed. And for one additional
00:12:26pass that took 10 minutes and about 50,000 tokens, not bad. Now we'll look at GLM. And at this point,
00:12:32it took about another 1.2 or so million tokens to make this update, putting our total spend at $1.83.
00:12:38So let's start it up. And it looks like it tried to add some sort of different lighting. The car looks
00:12:46a little bit better, but the lighting itself is kind of strange. Like it's just very glary. The track
00:12:52itself hasn't changed a whole lot. You know, it's still kind of just like grass everywhere. And the
00:12:57controls are still very jumpy, right? Like I'm going very fast relative to the track. Same sort of issue
00:13:04that I had before where like some of the track I can pass through some of it, I can't. So I mean,
00:13:10the graphics for the car look better, but I would argue the lighting and the glare is so distracting.
00:13:15It's probably kind of a downgrade to what we had prior. And here is the second pass with 5.5. Now
00:13:21the car looks a little bit better, but looking at everything else, this is kind of the same. Well,
00:13:29the wheels are better. We fixed the wheel issue. They're actually turning the way wheels should,
00:13:34but still has annoying noises. And there's no real differentiation again, between like the path
00:13:42and like the grass. So it kind of feels like sort of the exact same thing it did the first time with a
00:13:49slightly better car. But you know, when we told it go for like a triple A aesthetic, I wouldn't say it
00:13:55hit the mark. And again, I feel like big picture. We look at these three GM and 5.5, definitely a step
00:14:02below Opus. Now for our next test, we're going to have it build us a website. And the prompt we're going
00:14:07to be using is this. We want it to build a fake landing page for a product, which is AI powered
00:14:12smart glasses. Think something like meta Ray-Bans. Again, we're giving these models full freedom in
00:14:16terms of the stack and design. We're telling it to pick whatever we think is best, install what we
00:14:20need and look up the best practices for creating landing pages. We're telling it, Hey, go ahead and find
00:14:25images and product shots. And don't just rely on creating your own sort of HTML stuff. And importantly,
00:14:31we're saying, make it look like an award site. We don't want it to look like AI slop. We want real
00:14:35visual hierarchy, intentional typography, and motion where it makes sense. So landing page for smart
00:14:42glasses, we want it to be sort of award style. So let's see what they come up with. So all three
00:14:46of them finished up for reference, GLM used about a million tokens to execute this while Opus and 5.5
00:14:53used about a hundred thousand, give or take. So first up we have what Opus build us very dark background.
00:14:58It has sort of these glasses it created, and the text is sort of cut off right here, which is
00:15:04unfortunate. As we scroll down, this is also kind of oddly placed because we can see the scroll text
00:15:12kind of over the top of it. But as I mouse over, you can see kind of like move around and it changes
00:15:18color, which is kind of cool. As I scroll down, we have some scrolling sort of like loading animations
00:15:24for everything. But all in all, it looks fine for the glasses themselves that use like HTML.
00:15:31So it's like, what are you really getting out of this? It didn't even like find some sort of glasses
00:15:35to use. And it has, you know, hey, here's how you can reserve it and here's how you can buy it. So
00:15:41it's fine. Again, we didn't give it a ton of direction, but we told it to go for like an awards
00:15:45type look. I would not consider it on that sort of level. Now let's take a look at what GLM built us.
00:15:51And I don't actually know what's going on here at all. In fact, this is kind of like barely loaded.
00:15:59It shows us some glasses, but like this website's kind of like a disaster. It's like it didn't even
00:16:04really finish this. It almost like just threw it all together. Yeah. Yeah, the prompt wasn't super
00:16:13detailed, but it should be able to do more than this based on what I give it. This is like actually
00:16:19terrible. I have no idea what it actually was trying to accomplish here. And lastly, we have GPT 5.5. So
00:16:25this is a little bit interesting. I think it looks kind of cool, although the glasses
00:16:30somewhat overlap the text here. And we have a lot of dead space, which you could argue that's
00:16:34something of a design choice. And we have the banner that actually moves, you'll remember the
00:16:39Opus version did have a banner, but it wasn't moving. And then as we scroll down, you'll also notice the
00:16:44cursor is kind of like multicolored. And as we scroll down, it looks like it created some HTML
00:16:50type assets. I mean, strange, right? We did tell it, hey, you can go find what you need to find online
00:16:55if you want to. But overall, probably the best out of the three. But, you know, I wouldn't say I was in
00:17:04love with any of these kind of shows you how like strong of a hand you need to take when doing any
00:17:09sort of like visual design or like UI type things like even these most advanced models struggle like
00:17:14I actually have no idea what the heck is going on. Like this is this is a mess. So overall, Opus was
00:17:21okay. 5.5 was the best of the bunch and GLM was like actually a complete failure. And just like we did with
00:17:26the gaming version, we're going to give them a second pass at this and see if they can clean up what went
00:17:30wrong. And on top of that, we're going to ask them to integrate similar to, again, the game we had them
00:17:36create some like three JS elements, like we really want to see how it can sort of push its capabilities with
00:17:42sort of like motion and like graphics and that sort of thing. And that new prompt looks like this take
00:17:46the smart glasses landing page you've just built and rebuild it as an immersive 3d experience using
00:17:51three dot JS. So we want an actual interactive 3d scene. And again, we're giving it full freedom to
00:17:56execute it as it sees fit. And so here's what we got with Opus 4.8. You can see now that it added
00:18:02some three JS, these glasses sort of move. But beyond that, we have some of the original issues,
00:18:08right, the text being cut off, it being overwritten right here. And the rest of this kind of just being
00:18:13like, man, like this is pretty like, obvious that AI created this. Oh, end of note, like token costs
00:18:21were pretty much equal on the second run across the board to the first run for all these. Next,
00:18:27we have GLM 5.2. And this time it actually created a website that makes sense. We have these glasses,
00:18:32although the glasses that made are kind of like odd, like you only have like, you know,
00:18:36no glasses would actually look like that in the text is also cut off here. But we have a banner
00:18:42that does scroll when I scroll over the top of it, it does stop. And I would say overall, in terms of
00:18:48like how it laid out the website, I would probably give it the edge over Opus. Now, I don't think
00:18:55either of them are particularly good. And we kind of gave them free range to do whatever they want. But
00:18:59I would put this over kind of this setup. Although in terms of like the hero section itself,
00:19:05I do like Opus 4.8 better. Now, GPT 5.5, I think is the winner here. I think this just looks
00:19:10better overall from a subjective design standpoint. And I think the three JS sort of like motion graphics
00:19:18it added here are pretty cool. I think it makes sense in context of what it created. Like we have
00:19:22all this white space up top and the glasses kind of, you know, are able to live in there. And as for the
00:19:27rest of the website, I think it looks fine. Again, it still looks very like, quote unquote,
00:19:32AI slop in the sense that AI definitely created this, but it doesn't look bad. And like from top
00:19:37to bottom, I do prefer what 5.5 gave us over all the others. And so when we take a look at this whole
00:19:42thing, holistically bringing in these more sophisticated benchmarks, like deep sweet alongside
00:19:48what we just did today, I think this is kind of what we expected. I don't think GLM did extremely poorly
00:19:56in any sense of the word, but it definitely felt like it was a step below GPT 5.5 or 4.8 or in
00:20:03scenarios where, you know, in the first section where Opus was better than all of them. And in
00:20:07the second section where GPT was better than all of them, GLM was always near the bottom. It wasn't
00:20:12grossly worse than any of them, but it certainly wasn't better. And it also used infinitely more tokens.
00:20:17And so when we take a look at something like this, the deep sweet score, where it's like,
00:20:21hey, GLM is kind of at the bottom and actually is less efficient than 5.5 and 4.8, both in terms of
00:20:27cost and how well it does. It kind of makes sense. I think this is kind of what we see. And so big
00:20:35picture, is GLM a great open source model? Definitely. But does it run into some issues that open source
00:20:41models have in general, namely, they aren't as powerful? Yes. And furthermore, if you are someone who is
00:20:47open source maxing, understand this is not something you would run on your PC, right? This requires a
00:20:52ton of hardware to use. And I think what gets lost in the conversation is what we talked about at the
00:20:57beginning, which is like, okay, the costs are kind of already a problem for GLM 5.2. Yet this doesn't
00:21:05even take into account the huge subsidization you get on either the Anthropic Max plan or the OpenAI
00:21:12Max plan. So you put that in mind and like, okay, like the kind of isn't a debate.
00:21:16It really isn't a debate. So would I suggest using GLM 5.2 for your average person? No,
00:21:24not really. I think maybe if you're doing lower level tasks and you're someone who's comparing
00:21:29it purely on API prices, maybe, maybe. But it's, you know, I think it's kind of hard to argue that
00:21:38because then what are we doing when the next, when, you know, Sonnet 5 comes out next week? Like,
00:21:42are you just going to jump from there to there? Like there's something to be said with just like
00:21:46sticking with the model, especially when we're talking more like enterprise team level stuff,
00:21:50where the API costs really start to add up. Because again, for the average single user who's going to
00:21:55be using one of the subsidized plans and isn't paying straight up API costs, I don't see an argument for
00:22:01GLM 5.2. So that's where I'm going to leave you guys for today. Hopefully I shed some light on this
00:22:05whole GLM debate and all the hype that you see coming out around it. As always, let me know what you
00:22:09thought in the comments. Make sure to check out Chase AI Plus if you want to get your hands on the
00:22:13in the Cloud Code Masterclass, and I'll see you around.