00:00:00Anthropic just released Clawed Opus 4.6 and it achieves the highest score on Terminal Bench 2.0 out of any model
00:00:06Sorry to interrupt your programming here
00:00:10But it turns out GPT 5.3 codecs just came out and that actually beats Opus 4.6 on Terminal Bench by over 10%
00:00:16So it seems like Anthropic's reign was genuinely only a few minutes. The competition between these two is really heating up
00:00:23So I'm super curious to see what's new in these models and find out which one feels the best to use as lately for me
00:00:29It's actually GPT 5.2 that's felt better
00:00:31So I'm curious to see if Clawed can claw back some of their advantage or if OpenAI were ready with GPT 5.3 codecs
00:00:37First up a quick TL;DR on what's new on these models as we all know they're gonna be better than their last versions on the benchmark
00:00:48Which I'll show at the end, but has anything else actually changed about the models?
00:00:52Well for Opus
00:00:53They're actually claiming they can plan more carefully and sustain agentic tasks for longer and can operate more reliably in larger code bases with better
00:01:00Code review and debugging skills to catch its own mistakes
00:01:02Now these are actually a few of the things that I found Opus was weakest at compared to GPT 5.2. In my experience
00:01:08It typically got started coding faster and usually just made a few more mistakes
00:01:12Whereas GPT 5.2 actually took a little longer to get coding and understood the context of the repo
00:01:17So hopefully these changes do improve Opus here and it's also probably going to be improved by its new 1 million total
00:01:23context window
00:01:24Although it is mentioned this is in beta and similar to other providers
00:01:27It will also cost you extra with prompts exceeding 200,000 tokens costing you $10 for a million input tokens and
00:01:33$37.50 for a million output tokens. Moving on to codecs 5.3
00:01:38OpenAI are stating that this model advances the frontier coding performance of GPT 5.2 codecs and the reasoning and professional knowledge
00:01:45Capabilities of GPT 5.2 together in one model, which is also 25% faster
00:01:51This should enable it to take on long-running tasks that involve research tool use and complex execution
00:01:57So it really seems that they push this model to be a bit of an all-rounder now with GPT 5.2 knowledge and improved coding capabilities
00:02:03But all of that is just marketing speed
00:02:05So let's put these models through some real-world tests and the first one I was trying was updating a convex agent package to support the AI
00:02:11SDK v6. I've been really liking convex as my database lately and this package essentially just helps link the AI SDK with the database
00:02:19So you get really good performance, but the problem is it wasn't upgraded to the latest version
00:02:23You can see here on the cells documentation that the migration from v5 to v6 is not an easy migration to make
00:02:28They made a lot of breaking changes and changed a lot of types
00:02:32So what I did was make a basic chat app in convex that actually worked using the agent package
00:02:36But then I upgraded the packages to v6 and I got a load of build and type errors
00:02:40I simply asked the models to fix them. You can see the problem I used here in codecs
00:02:44I said I'm building a chat app with convex and I had a working version
00:02:46But then I upgraded to v6 and I need to fix the type and build errors
00:02:50I passed it in the migration guide so it can use that its context if it wants and I said I want all of the tests
00:02:55Passing avoid typescript hacks like as any where possible as I often see a lot of the models do this
00:02:59So I specifically wanted to say please don't as there's quite a lot of complex types in this AI
00:03:03SDK now since we're already on codecs we can see how 5.3 codecs performed it started off by
00:03:09Understanding the repo you can see it saw it was a mono repo with that packages slash agent that we had then it identified a few
00:03:15root causes and some packages that needed to be upgraded and listed out exactly how it was going to work through this task and after that
00:03:22It just got started coding made a few changes would run a build every so often and just worked on
00:03:27Fixing all of those type errors and overall we actually ran for about 40 minutes completely uninterrupted
00:03:32Which I was super impressed with you can see actually added
00:03:35545 lines of code and removed a hundred and eleven over in claw code
00:03:39I gave it a copy of the exact same project and use the exact same prompt and again this worked through the task for around 40
00:03:44Minutes and it did have a few build errors when I actually tried to start it
00:03:48So I did have to send one more problem to actually get opus to give me a working version of the code
00:03:53But again, it was a pretty similar experience to how we saw in codecs
00:03:56But the one thing I must say I do really like the codecs UI. I prefer it to a terminal UI. I'm sorry
00:04:02Anyways, I can confirm after one point with codecs 5.3 and two prompts with opus 4.6
00:04:06They both managed to upgrade their agent package to the new version of the AI SDK with no type errors
00:04:11No build errors and all of the tests passing but they did handle it in different ways now here
00:04:16I have codecs on the left and the changes opus made on the right
00:04:19You can actually see opus made a few more changes to the project compared to codecs
00:04:23They actually handled a few of the features a little bit differently
00:04:25One of the things that codecs did really well is actually have this tool approval request logic here
00:04:30This was something that was new in the AI SDK v6. I can't seem to find any mention of this in opus
00:04:35It seems like it sort of just passed it over and didn't actually sort of add it into the code
00:04:40but one thing that I think codecs did really poorly was if we go over to the UI messages actually added in its
00:04:46Completely own function for converting a UI message to a model message now
00:04:50If you don't know the AI SDK actually just has a function to do this for you and it should definitely use that instead
00:04:57You can actually see side-by-side here that opus did this correctly
00:05:00It just used the convert to model messages function that comes from the AI SDK
00:05:04And what this means is in the future if they do upgrade this package
00:05:07I'm not going to have to worry about making any changes to my own version here as I should just be using the one that comes
00:05:13from the package
00:05:14So this is a little bit of annoying thing and a bit of a red flag to me when I was looking through this code
00:05:19But to get a second opinion on my code review
00:05:20I actually passed the disk back into codecs 5.3 and asked it to do a review along with me and you see it listed out the
00:05:26Advantages and disadvantages of each approach here
00:05:29But down at the bottom it gave me a conclusion and codecs 5.3 actually prefers the opus chat version that has a better migration
00:05:36Architecture never had to pick one to better base to ship safely
00:05:39It would choose opus chat then pull over codecs chats approval and denial handling
00:05:43So that extra function we saw for the tool approval request
00:05:46It says simply just take that from the codecs version and add it to the opus version and we have a better migration
00:05:51So it's at least nice to see that codecs 5.3 isn't biased there and it didn't choose itself
00:05:55But I must admit the way that both of these handled the migration was pretty similar and I could probably prompt them to go in the right
00:06:01Direction, but one test isn't enough
00:06:03So for the next test, it's a little less serious, but I asked both of them if they could create me a club penguin clone
00:06:08Using 3js now, I'm not going to tell you which is which but this is the first game that we got out
00:06:13You can see I have a create your penguin here and we're actually seeing the avatar up here change
00:06:17I can add on some caps here. So I've got a party hat a propeller a crown
00:06:21I'm gonna choose the propeller cap here and click play and if you actually know anything about Club Penguin
00:06:26I'd say this has done. Okay job sort of mimicking the town center that we have although the pizza isn't over here
00:06:32There's normally a disco center here and you can't actually go into any of these buildings
00:06:35You can see none of these are solid yet
00:06:37But what it did quite well is if we go to the map we can go to different zones here
00:06:41So we have the ski village if I click and move around here
00:06:44I do think my penguin looks okay for something in 3js where I gave it no assets or anything like that
00:06:49It's done this all sort of from its own training and we can actually go in and play the sled racing game here
00:06:54Which was my absolute favorite in Club Penguin and there's definitely a few things missing
00:06:59I must admit but uh, it's a pretty good first pass it did all of this in a single prompt
00:07:04I can even confirm that this version does have an attempt at the cart surfer game here
00:07:07Which was my favorite on Club Penguin, but this one seems a little broken
00:07:11You can sort of just go from side to side and now I think I'm under the map. It's also really dark now
00:07:15This is what the other model gave me and I want you to put in the comments
00:07:18Which model you think did a better job and if you can work out which model made each version?
00:07:22I'll tell you at the end of this test you see in this one
00:07:25We have the same color selectors that was in the prompt
00:07:27We also have the hat and accessory here. So I'll choose crown this time and we click start exploring
00:07:31The penguins a little chunkier in this version. I must say it's it's funnier looking but again, I gave this no assets
00:07:36This is just from scratch in 3js
00:07:38It has the same problem where you can sort of walk through your buildings
00:07:41But we do have the map and we have all of the different zones here
00:07:44So if I go over to the ski village
00:07:46I should be able to play the game so I can play sled racing here and to be honest
00:07:50This is pretty similar to the other version of the sled racing game that we had
00:07:53You can see we have some of the trees coming up in the distance here
00:07:56We have three lives and the life counter does actually work
00:07:58But it doesn't seem like we can jump in this version
00:08:01This model though did also give me a version of the cart surfer game
00:08:04But again, this one is a little weird
00:08:06Although I guess it's more functional because you can actually see things in this version and you can jump but uh
00:08:11I'm not sure where I'm actually surfing. There is no sort of rail and overall. Yeah, it's it's not the cart surfer game
00:08:17I remember from Club Penguin overall though
00:08:19I'm always impressed with all these models can do in a single prompt especially with 3js and if you're wondering which model did which the
00:08:25First one was opus 4.6 and the second one was codex 5.3, and I actually think I prefer the first one
00:08:30So I think opus 4.6 wins on my Club Penguin test now the final test
00:08:34I ran on these models was to see how good they are at UI design those models are getting pretty good at this
00:08:38So I gave both of them prompt to build me a landing page for an AI only social media site
00:08:42So similar to molt book and the page should be snarky and emphasizes the future and for AI only and do this all in a single
00:08:49HTML file this is the result I got back from both the prompts then and I must admit I am very impressed with codex here
00:08:55We have codex 5.3 on the left and opus 4.6 on the right and I just really like the way that codex
00:09:005.3 went with this site
00:09:01It's gone for a neo brutalism design and it's just a little more fun than some of the other vibe coded sites
00:09:06I think opus 4.6 here while being a good design just looks like a typical vibe coded app. It's done it very well
00:09:13I must admit but again
00:09:14It's got these purple gradients and everything about this just screams that it was vibe coded or as I think the codex
00:09:205.3 version looks like someone has had a bit more manual input maybe prompted it to go in that direction
00:09:25Even though I gave them the exact same prompt
00:09:27The only thing that I think opus 4.6 did a little bit better is the page is actually a little more functional
00:09:32You can see we have this sort of trending tab down here. We have rules top models of the week
00:09:36We have popular subreddits and also a popular feed whereas the codex 5.3 one is a little more bare
00:09:41And we sort of just have this trending tab down here and that is it
00:09:44So I'm definitely curious to see how these score on design arena as they just came out
00:09:47So they're not ranked yet, but the moment GLM 4.7 is currently the leader
00:09:51So I want to see if 5.3 codex or opus 4.6 can take that crown overall
00:09:55Both of the models are pretty capable and it's quite hard to tell which one is going to be the best
00:09:59I think personally I'll probably lean towards 5.3 codex
00:10:03But purely because I like the codex app and just the overall experience I've had with prompting open AI models if we want to compare
00:10:09Them on the benchmarks though as I mentioned in the intro codex has a massive advantage on terminal bench 2.0
00:10:15Which is actually a pretty incredible leap and that's basically the only benchmark that we can currently compare as I don't think
00:10:21Anthropic were ready for open AI to release this model yet and annoyingly they don't use the same benchmarks in their blog posts
00:10:28I did check artificial analysis and so far they've only benchmarked opus 4.6 for coding but only the non reasoning version as well
00:10:35But I guess it's pretty impressive that the non reasoning version of 4.6 actually performs as well as the reasoning version of 4.5
00:10:42Opus, my personal feeling at the moment is the opus 4.5 to 4.6 is a little more marginal than 5.2 codexes to 5.3
00:10:49But I'm gonna have to use both of them and see how they feel in the real world
00:10:53There's a final few extras in both of these releases
00:10:55And one of the coolest ones is that both of the models apparently have improved cyber security capabilities with open AI saying that GPT
00:11:015.3 codex is the first model they classify as high capability for cybersecurity related tasks and the first they've actually directly trained to identify
00:11:09Software vulnerabilities and anthropic basically says the same in this long blog post now one feature of codex that I'm really expecting to like
00:11:16Is it can actually be steered while it's working they say instead of waiting for a final output
00:11:21You can actually interact in real time asking questions and discussing approaches and steering it toward a solution
00:11:27And I just think this approach is a little bit better as I'm always debating whether I should let the model finish first or if I
00:11:32Should interrupt it stop it what it's doing when I want it to make changes
00:11:35And I just think especially when we now have tasks that can run for significant lengths of time
00:11:40This is going to be a much nicer user experience. We can actually talk to it while it's working
00:11:44Finally, we have a few new features for Claude as well. The first one is include code
00:11:48You can now use agent teams to work on tasks together aka sub agents Richard actually made a video on this earlier this week
00:11:55So check that out if you're interested in learning more and there was also some cool API features like Claude now has a compaction feature
00:12:01Built into the API so you can actually use that to summarize its context and perform a longer running tasks
00:12:06And there's also a new adaptive thinking mode
00:12:08So essentially you just let the model pick up on contextual clues to see how much it should actually use its extended thinking
00:12:13There we go coding models have come a seriously long way
00:12:16If you didn't know it's actually not even been a year since Claude code was released
00:12:20Let me know what you think of all of these models in the comments while you're there subscribe and as always see you in the next one
00:12:31(upbeat music)