OpenAI is Winning... (Opus 4.6 + Codex 5.3)

BBetter Stack
Computing/SoftwareBusiness NewsVideo & Computer GamesInternet Technology

Transcript

00:00:00Anthropic just released Clawed Opus 4.6 and it achieves the highest score on Terminal Bench 2.0 out of any model
00:00:06Sorry to interrupt your programming here
00:00:10But it turns out GPT 5.3 codecs just came out and that actually beats Opus 4.6 on Terminal Bench by over 10%
00:00:16So it seems like Anthropic's reign was genuinely only a few minutes. The competition between these two is really heating up
00:00:23So I'm super curious to see what's new in these models and find out which one feels the best to use as lately for me
00:00:29It's actually GPT 5.2 that's felt better
00:00:31So I'm curious to see if Clawed can claw back some of their advantage or if OpenAI were ready with GPT 5.3 codecs
00:00:37First up a quick TL;DR on what's new on these models as we all know they're gonna be better than their last versions on the benchmark
00:00:48Which I'll show at the end, but has anything else actually changed about the models?
00:00:52Well for Opus
00:00:53They're actually claiming they can plan more carefully and sustain agentic tasks for longer and can operate more reliably in larger code bases with better
00:01:00Code review and debugging skills to catch its own mistakes
00:01:02Now these are actually a few of the things that I found Opus was weakest at compared to GPT 5.2. In my experience
00:01:08It typically got started coding faster and usually just made a few more mistakes
00:01:12Whereas GPT 5.2 actually took a little longer to get coding and understood the context of the repo
00:01:17So hopefully these changes do improve Opus here and it's also probably going to be improved by its new 1 million total
00:01:23context window
00:01:24Although it is mentioned this is in beta and similar to other providers
00:01:27It will also cost you extra with prompts exceeding 200,000 tokens costing you $10 for a million input tokens and
00:01:33$37.50 for a million output tokens. Moving on to codecs 5.3
00:01:38OpenAI are stating that this model advances the frontier coding performance of GPT 5.2 codecs and the reasoning and professional knowledge
00:01:45Capabilities of GPT 5.2 together in one model, which is also 25% faster
00:01:51This should enable it to take on long-running tasks that involve research tool use and complex execution
00:01:57So it really seems that they push this model to be a bit of an all-rounder now with GPT 5.2 knowledge and improved coding capabilities
00:02:03But all of that is just marketing speed
00:02:05So let's put these models through some real-world tests and the first one I was trying was updating a convex agent package to support the AI
00:02:11SDK v6. I've been really liking convex as my database lately and this package essentially just helps link the AI SDK with the database
00:02:19So you get really good performance, but the problem is it wasn't upgraded to the latest version
00:02:23You can see here on the cells documentation that the migration from v5 to v6 is not an easy migration to make
00:02:28They made a lot of breaking changes and changed a lot of types
00:02:32So what I did was make a basic chat app in convex that actually worked using the agent package
00:02:36But then I upgraded the packages to v6 and I got a load of build and type errors
00:02:40I simply asked the models to fix them. You can see the problem I used here in codecs
00:02:44I said I'm building a chat app with convex and I had a working version
00:02:46But then I upgraded to v6 and I need to fix the type and build errors
00:02:50I passed it in the migration guide so it can use that its context if it wants and I said I want all of the tests
00:02:55Passing avoid typescript hacks like as any where possible as I often see a lot of the models do this
00:02:59So I specifically wanted to say please don't as there's quite a lot of complex types in this AI
00:03:03SDK now since we're already on codecs we can see how 5.3 codecs performed it started off by
00:03:09Understanding the repo you can see it saw it was a mono repo with that packages slash agent that we had then it identified a few
00:03:15root causes and some packages that needed to be upgraded and listed out exactly how it was going to work through this task and after that
00:03:22It just got started coding made a few changes would run a build every so often and just worked on
00:03:27Fixing all of those type errors and overall we actually ran for about 40 minutes completely uninterrupted
00:03:32Which I was super impressed with you can see actually added
00:03:35545 lines of code and removed a hundred and eleven over in claw code
00:03:39I gave it a copy of the exact same project and use the exact same prompt and again this worked through the task for around 40
00:03:44Minutes and it did have a few build errors when I actually tried to start it
00:03:48So I did have to send one more problem to actually get opus to give me a working version of the code
00:03:53But again, it was a pretty similar experience to how we saw in codecs
00:03:56But the one thing I must say I do really like the codecs UI. I prefer it to a terminal UI. I'm sorry
00:04:02Anyways, I can confirm after one point with codecs 5.3 and two prompts with opus 4.6
00:04:06They both managed to upgrade their agent package to the new version of the AI SDK with no type errors
00:04:11No build errors and all of the tests passing but they did handle it in different ways now here
00:04:16I have codecs on the left and the changes opus made on the right
00:04:19You can actually see opus made a few more changes to the project compared to codecs
00:04:23They actually handled a few of the features a little bit differently
00:04:25One of the things that codecs did really well is actually have this tool approval request logic here
00:04:30This was something that was new in the AI SDK v6. I can't seem to find any mention of this in opus
00:04:35It seems like it sort of just passed it over and didn't actually sort of add it into the code
00:04:40but one thing that I think codecs did really poorly was if we go over to the UI messages actually added in its
00:04:46Completely own function for converting a UI message to a model message now
00:04:50If you don't know the AI SDK actually just has a function to do this for you and it should definitely use that instead
00:04:57You can actually see side-by-side here that opus did this correctly
00:05:00It just used the convert to model messages function that comes from the AI SDK
00:05:04And what this means is in the future if they do upgrade this package
00:05:07I'm not going to have to worry about making any changes to my own version here as I should just be using the one that comes
00:05:13from the package
00:05:14So this is a little bit of annoying thing and a bit of a red flag to me when I was looking through this code
00:05:19But to get a second opinion on my code review
00:05:20I actually passed the disk back into codecs 5.3 and asked it to do a review along with me and you see it listed out the
00:05:26Advantages and disadvantages of each approach here
00:05:29But down at the bottom it gave me a conclusion and codecs 5.3 actually prefers the opus chat version that has a better migration
00:05:36Architecture never had to pick one to better base to ship safely
00:05:39It would choose opus chat then pull over codecs chats approval and denial handling
00:05:43So that extra function we saw for the tool approval request
00:05:46It says simply just take that from the codecs version and add it to the opus version and we have a better migration
00:05:51So it's at least nice to see that codecs 5.3 isn't biased there and it didn't choose itself
00:05:55But I must admit the way that both of these handled the migration was pretty similar and I could probably prompt them to go in the right
00:06:01Direction, but one test isn't enough
00:06:03So for the next test, it's a little less serious, but I asked both of them if they could create me a club penguin clone
00:06:08Using 3js now, I'm not going to tell you which is which but this is the first game that we got out
00:06:13You can see I have a create your penguin here and we're actually seeing the avatar up here change
00:06:17I can add on some caps here. So I've got a party hat a propeller a crown
00:06:21I'm gonna choose the propeller cap here and click play and if you actually know anything about Club Penguin
00:06:26I'd say this has done. Okay job sort of mimicking the town center that we have although the pizza isn't over here
00:06:32There's normally a disco center here and you can't actually go into any of these buildings
00:06:35You can see none of these are solid yet
00:06:37But what it did quite well is if we go to the map we can go to different zones here
00:06:41So we have the ski village if I click and move around here
00:06:44I do think my penguin looks okay for something in 3js where I gave it no assets or anything like that
00:06:49It's done this all sort of from its own training and we can actually go in and play the sled racing game here
00:06:54Which was my absolute favorite in Club Penguin and there's definitely a few things missing
00:06:59I must admit but uh, it's a pretty good first pass it did all of this in a single prompt
00:07:04I can even confirm that this version does have an attempt at the cart surfer game here
00:07:07Which was my favorite on Club Penguin, but this one seems a little broken
00:07:11You can sort of just go from side to side and now I think I'm under the map. It's also really dark now
00:07:15This is what the other model gave me and I want you to put in the comments
00:07:18Which model you think did a better job and if you can work out which model made each version?
00:07:22I'll tell you at the end of this test you see in this one
00:07:25We have the same color selectors that was in the prompt
00:07:27We also have the hat and accessory here. So I'll choose crown this time and we click start exploring
00:07:31The penguins a little chunkier in this version. I must say it's it's funnier looking but again, I gave this no assets
00:07:36This is just from scratch in 3js
00:07:38It has the same problem where you can sort of walk through your buildings
00:07:41But we do have the map and we have all of the different zones here
00:07:44So if I go over to the ski village
00:07:46I should be able to play the game so I can play sled racing here and to be honest
00:07:50This is pretty similar to the other version of the sled racing game that we had
00:07:53You can see we have some of the trees coming up in the distance here
00:07:56We have three lives and the life counter does actually work
00:07:58But it doesn't seem like we can jump in this version
00:08:01This model though did also give me a version of the cart surfer game
00:08:04But again, this one is a little weird
00:08:06Although I guess it's more functional because you can actually see things in this version and you can jump but uh
00:08:11I'm not sure where I'm actually surfing. There is no sort of rail and overall. Yeah, it's it's not the cart surfer game
00:08:17I remember from Club Penguin overall though
00:08:19I'm always impressed with all these models can do in a single prompt especially with 3js and if you're wondering which model did which the
00:08:25First one was opus 4.6 and the second one was codex 5.3, and I actually think I prefer the first one
00:08:30So I think opus 4.6 wins on my Club Penguin test now the final test
00:08:34I ran on these models was to see how good they are at UI design those models are getting pretty good at this
00:08:38So I gave both of them prompt to build me a landing page for an AI only social media site
00:08:42So similar to molt book and the page should be snarky and emphasizes the future and for AI only and do this all in a single
00:08:49HTML file this is the result I got back from both the prompts then and I must admit I am very impressed with codex here
00:08:55We have codex 5.3 on the left and opus 4.6 on the right and I just really like the way that codex
00:09:005.3 went with this site
00:09:01It's gone for a neo brutalism design and it's just a little more fun than some of the other vibe coded sites
00:09:06I think opus 4.6 here while being a good design just looks like a typical vibe coded app. It's done it very well
00:09:13I must admit but again
00:09:14It's got these purple gradients and everything about this just screams that it was vibe coded or as I think the codex
00:09:205.3 version looks like someone has had a bit more manual input maybe prompted it to go in that direction
00:09:25Even though I gave them the exact same prompt
00:09:27The only thing that I think opus 4.6 did a little bit better is the page is actually a little more functional
00:09:32You can see we have this sort of trending tab down here. We have rules top models of the week
00:09:36We have popular subreddits and also a popular feed whereas the codex 5.3 one is a little more bare
00:09:41And we sort of just have this trending tab down here and that is it
00:09:44So I'm definitely curious to see how these score on design arena as they just came out
00:09:47So they're not ranked yet, but the moment GLM 4.7 is currently the leader
00:09:51So I want to see if 5.3 codex or opus 4.6 can take that crown overall
00:09:55Both of the models are pretty capable and it's quite hard to tell which one is going to be the best
00:09:59I think personally I'll probably lean towards 5.3 codex
00:10:03But purely because I like the codex app and just the overall experience I've had with prompting open AI models if we want to compare
00:10:09Them on the benchmarks though as I mentioned in the intro codex has a massive advantage on terminal bench 2.0
00:10:15Which is actually a pretty incredible leap and that's basically the only benchmark that we can currently compare as I don't think
00:10:21Anthropic were ready for open AI to release this model yet and annoyingly they don't use the same benchmarks in their blog posts
00:10:28I did check artificial analysis and so far they've only benchmarked opus 4.6 for coding but only the non reasoning version as well
00:10:35But I guess it's pretty impressive that the non reasoning version of 4.6 actually performs as well as the reasoning version of 4.5
00:10:42Opus, my personal feeling at the moment is the opus 4.5 to 4.6 is a little more marginal than 5.2 codexes to 5.3
00:10:49But I'm gonna have to use both of them and see how they feel in the real world
00:10:53There's a final few extras in both of these releases
00:10:55And one of the coolest ones is that both of the models apparently have improved cyber security capabilities with open AI saying that GPT
00:11:015.3 codex is the first model they classify as high capability for cybersecurity related tasks and the first they've actually directly trained to identify
00:11:09Software vulnerabilities and anthropic basically says the same in this long blog post now one feature of codex that I'm really expecting to like
00:11:16Is it can actually be steered while it's working they say instead of waiting for a final output
00:11:21You can actually interact in real time asking questions and discussing approaches and steering it toward a solution
00:11:27And I just think this approach is a little bit better as I'm always debating whether I should let the model finish first or if I
00:11:32Should interrupt it stop it what it's doing when I want it to make changes
00:11:35And I just think especially when we now have tasks that can run for significant lengths of time
00:11:40This is going to be a much nicer user experience. We can actually talk to it while it's working
00:11:44Finally, we have a few new features for Claude as well. The first one is include code
00:11:48You can now use agent teams to work on tasks together aka sub agents Richard actually made a video on this earlier this week
00:11:55So check that out if you're interested in learning more and there was also some cool API features like Claude now has a compaction feature
00:12:01Built into the API so you can actually use that to summarize its context and perform a longer running tasks
00:12:06And there's also a new adaptive thinking mode
00:12:08So essentially you just let the model pick up on contextual clues to see how much it should actually use its extended thinking
00:12:13There we go coding models have come a seriously long way
00:12:16If you didn't know it's actually not even been a year since Claude code was released
00:12:20Let me know what you think of all of these models in the comments while you're there subscribe and as always see you in the next one
00:12:31(upbeat music)

Key Takeaway

The competition between OpenAI and Anthropic has reached a fever pitch with the near-simultaneous release of GPT 5.3 Codex and Claude Opus 4.6, both showing massive leaps in autonomous coding, long-context reasoning, and cybersecurity.

Highlights

OpenAI's GPT 5.3 Codex has surpassed Anthropic's Claude Opus 4.6 on Terminal Bench 2.0 by over 10%.

Claude Opus 4.6 introduces a 1-million-token context window (in beta) and improved agentic planning capabilities.

GPT 5.3 Codex is 25% faster than its predecessor and integrates reasoning with professional knowledge for complex execution.

Real-world coding tests showed Opus 4.6 followed best practices better for library functions, while Codex 5.3 excelled at new API features like tool approval.

Both models now feature high-tier cybersecurity capabilities, with Codex being directly trained to identify software vulnerabilities.

OpenAI introduced a 'steering' feature for Codex 5.3, allowing users to interact and guide the model in real-time while it works.

Timeline

The Benchmark Battle: Opus 4.6 vs. Codex 5.3

The video opens with the rapid-fire release of two major AI models, Anthropic's Claude Opus 4.6 and OpenAI's GPT 5.3 Codex. While Opus 4.6 initially set a high bar on Terminal Bench 2.0, the speaker notes that Codex 5.3 quickly overtook it by a margin of 10%. This section highlights the intense pace of the industry, where a model's 'reign' at the top can last only minutes. The speaker expresses a personal preference for the feel of GPT 5.2 and sets the stage for a head-to-head comparison. This context is crucial for understanding the current state of 'state-of-the-art' AI benchmarks.

Model Specs and New Features Breakdown

The speaker provides a TL;DR of the technical improvements claimed by both AI labs. Anthropic focuses on reliable agentic tasks and a massive 1-million-token context window, though it comes with a tiered pricing structure for prompts over 200,000 tokens. OpenAI positions Codex 5.3 as an all-rounder that is 25% faster and combines the reasoning of GPT 5.2 with enhanced coding logic. This section matters because it moves past 'marketing speak' to outline the specific utility of long-running research and tool use tasks. The mention of specific pricing, such as $37.50 for a million output tokens, provides concrete data for developers.

Real-World Test: Upgrading the AI SDK

The first major test involves migrating a Convex agent package to the AI SDK v6, a task involving many breaking changes and complex type errors. Both models ran for approximately 40 minutes autonomously, with Codex 5.3 successfully completing the task in one prompt while Opus 4.6 required two. The speaker observes that Codex identified the mono-repo structure and root causes effectively before starting its coding run. Ultimately, both models produced working code with all tests passing, demonstrating high reliability for complex refactoring. This segment proves that modern models can handle significant, uninterrupted engineering tasks.

Code Review and Comparative Logic

A side-by-side code review reveals significant differences in how the two models approached the migration. While Codex correctly implemented a new 'tool approval' logic feature, it failed by creating a redundant custom function instead of using an existing SDK utility. In contrast, Opus 4.6 correctly utilized the built-in 'convert to model messages' function, showing better adherence to library standards. Interestingly, when the speaker asked Codex to review Opus's code, the model admitted that the Opus version had better architecture. This demonstrates the importance of manual code review even when AI produces 'working' code.

Creative Coding: The Club Penguin Challenge

To test creativity and asset-free development, the speaker prompts both models to build a Club Penguin clone using 3js. Both versions included functional avatar customization, map navigation, and minigames like sled racing and cart surfing. The speaker notes that the Opus 4.6 version felt more faithful to the original game's layout, while the Codex version was a bit 'chunkier' and funny-looking. Despite minor bugs like falling through the map or missing assets, the ability to build a 3D environment in a single prompt is highlighted as impressive. This test serves as a proxy for how these models handle spatial reasoning and game logic.

UI Design Arena and Social Media Landing Pages

The final qualitative test focuses on UI design, where the models are asked to build a 'snarky' landing page for an AI-only social media site. Codex 5.3 opted for a unique 'neo-brutalism' design, while Opus 4.6 produced a more standard, 'vibe-coded' modern layout with purple gradients. While Opus provided more functional page elements like trending tabs and rules, the speaker preferred the distinct aesthetic of Codex's design. The section references the 'Design Arena' rankings, where GLM 4.7 is the current leader, awaiting the updated scores for these new releases. This highlights the growing role of LLMs in front-end design and aesthetic decision-making.

Cybersecurity, Real-Time Steering, and Final Verdict

The video concludes with a look at advanced features like high-capability cybersecurity training and real-time model steering. The speaker highlights OpenAI's new ability to discuss and guide the model while it is mid-task as a major user experience improvement. Anthropic's 'adaptive thinking mode' and 'sub-agent' teams are also mentioned as powerful new tools for complex workflows. While the benchmark data heavily favors Codex 5.3, the speaker admits that real-world 'feel' is still marginal and requires more testing. The closing remarks emphasize that these models have evolved significantly in less than a year since the release of Claude Code.

Community Posts

View all posts