OpenAI is Winning... (Opus 4.6 + Codex 5.3)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareBusiness NewsVideo & Computer GamesInternet Technology

Transcript

00:00:00Anthropic just released Clawed Opus 4.6 and it achieves the highest score on Terminal Bench 2.0 out of any model

00:00:06Sorry to interrupt your programming here

00:00:10But it turns out GPT 5.3 codecs just came out and that actually beats Opus 4.6 on Terminal Bench by over 10%

00:00:16So it seems like Anthropic's reign was genuinely only a few minutes. The competition between these two is really heating up

00:00:23So I'm super curious to see what's new in these models and find out which one feels the best to use as lately for me

00:00:29It's actually GPT 5.2 that's felt better

00:00:31So I'm curious to see if Clawed can claw back some of their advantage or if OpenAI were ready with GPT 5.3 codecs

00:00:37First up a quick TL;DR on what's new on these models as we all know they're gonna be better than their last versions on the benchmark

00:00:48Which I'll show at the end, but has anything else actually changed about the models?

00:00:52Well for Opus

00:00:53They're actually claiming they can plan more carefully and sustain agentic tasks for longer and can operate more reliably in larger code bases with better

00:01:00Code review and debugging skills to catch its own mistakes

00:01:02Now these are actually a few of the things that I found Opus was weakest at compared to GPT 5.2. In my experience

00:01:08It typically got started coding faster and usually just made a few more mistakes

00:01:12Whereas GPT 5.2 actually took a little longer to get coding and understood the context of the repo

00:01:17So hopefully these changes do improve Opus here and it's also probably going to be improved by its new 1 million total

00:01:23context window

00:01:24Although it is mentioned this is in beta and similar to other providers

00:01:27It will also cost you extra with prompts exceeding 200,000 tokens costing you $10 for a million input tokens and

00:01:33$37.50 for a million output tokens. Moving on to codecs 5.3

00:01:38OpenAI are stating that this model advances the frontier coding performance of GPT 5.2 codecs and the reasoning and professional knowledge

00:01:45Capabilities of GPT 5.2 together in one model, which is also 25% faster

00:01:51This should enable it to take on long-running tasks that involve research tool use and complex execution

00:01:57So it really seems that they push this model to be a bit of an all-rounder now with GPT 5.2 knowledge and improved coding capabilities

00:02:03But all of that is just marketing speed

00:02:05So let's put these models through some real-world tests and the first one I was trying was updating a convex agent package to support the AI

00:02:11SDK v6. I've been really liking convex as my database lately and this package essentially just helps link the AI SDK with the database

00:02:19So you get really good performance, but the problem is it wasn't upgraded to the latest version

00:02:23You can see here on the cells documentation that the migration from v5 to v6 is not an easy migration to make

00:02:28They made a lot of breaking changes and changed a lot of types

00:02:32So what I did was make a basic chat app in convex that actually worked using the agent package

00:02:36But then I upgraded the packages to v6 and I got a load of build and type errors

00:02:40I simply asked the models to fix them. You can see the problem I used here in codecs

00:02:44I said I'm building a chat app with convex and I had a working version

00:02:46But then I upgraded to v6 and I need to fix the type and build errors

00:02:50I passed it in the migration guide so it can use that its context if it wants and I said I want all of the tests

00:02:55Passing avoid typescript hacks like as any where possible as I often see a lot of the models do this

00:02:59So I specifically wanted to say please don't as there's quite a lot of complex types in this AI

00:03:03SDK now since we're already on codecs we can see how 5.3 codecs performed it started off by

00:03:09Understanding the repo you can see it saw it was a mono repo with that packages slash agent that we had then it identified a few

00:03:15root causes and some packages that needed to be upgraded and listed out exactly how it was going to work through this task and after that

00:03:22It just got started coding made a few changes would run a build every so often and just worked on

00:03:27Fixing all of those type errors and overall we actually ran for about 40 minutes completely uninterrupted

00:03:32Which I was super impressed with you can see actually added

00:03:35545 lines of code and removed a hundred and eleven over in claw code

00:03:39I gave it a copy of the exact same project and use the exact same prompt and again this worked through the task for around 40

00:03:44Minutes and it did have a few build errors when I actually tried to start it

00:03:48So I did have to send one more problem to actually get opus to give me a working version of the code

00:03:53But again, it was a pretty similar experience to how we saw in codecs

00:03:56But the one thing I must say I do really like the codecs UI. I prefer it to a terminal UI. I'm sorry

00:04:02Anyways, I can confirm after one point with codecs 5.3 and two prompts with opus 4.6

00:04:06They both managed to upgrade their agent package to the new version of the AI SDK with no type errors

00:04:11No build errors and all of the tests passing but they did handle it in different ways now here

00:04:16I have codecs on the left and the changes opus made on the right

00:04:19You can actually see opus made a few more changes to the project compared to codecs

00:04:23They actually handled a few of the features a little bit differently

00:04:25One of the things that codecs did really well is actually have this tool approval request logic here

00:04:30This was something that was new in the AI SDK v6. I can't seem to find any mention of this in opus

00:04:35It seems like it sort of just passed it over and didn't actually sort of add it into the code

00:04:40but one thing that I think codecs did really poorly was if we go over to the UI messages actually added in its

00:04:46Completely own function for converting a UI message to a model message now

00:04:50If you don't know the AI SDK actually just has a function to do this for you and it should definitely use that instead

00:04:57You can actually see side-by-side here that opus did this correctly

00:05:00It just used the convert to model messages function that comes from the AI SDK

00:05:04And what this means is in the future if they do upgrade this package

00:05:07I'm not going to have to worry about making any changes to my own version here as I should just be using the one that comes

00:05:13from the package

00:05:14So this is a little bit of annoying thing and a bit of a red flag to me when I was looking through this code

00:05:19But to get a second opinion on my code review

00:05:20I actually passed the disk back into codecs 5.3 and asked it to do a review along with me and you see it listed out the

00:05:26Advantages and disadvantages of each approach here

00:05:29But down at the bottom it gave me a conclusion and codecs 5.3 actually prefers the opus chat version that has a better migration

00:05:36Architecture never had to pick one to better base to ship safely

00:05:39It would choose opus chat then pull over codecs chats approval and denial handling

00:05:43So that extra function we saw for the tool approval request

00:05:46It says simply just take that from the codecs version and add it to the opus version and we have a better migration

00:05:51So it's at least nice to see that codecs 5.3 isn't biased there and it didn't choose itself

00:05:55But I must admit the way that both of these handled the migration was pretty similar and I could probably prompt them to go in the right

00:06:01Direction, but one test isn't enough

00:06:03So for the next test, it's a little less serious, but I asked both of them if they could create me a club penguin clone

00:06:08Using 3js now, I'm not going to tell you which is which but this is the first game that we got out

00:06:13You can see I have a create your penguin here and we're actually seeing the avatar up here change

00:06:17I can add on some caps here. So I've got a party hat a propeller a crown

00:06:21I'm gonna choose the propeller cap here and click play and if you actually know anything about Club Penguin

00:06:26I'd say this has done. Okay job sort of mimicking the town center that we have although the pizza isn't over here

00:06:32There's normally a disco center here and you can't actually go into any of these buildings

00:06:35You can see none of these are solid yet

00:06:37But what it did quite well is if we go to the map we can go to different zones here

00:06:41So we have the ski village if I click and move around here

00:06:44I do think my penguin looks okay for something in 3js where I gave it no assets or anything like that

00:06:49It's done this all sort of from its own training and we can actually go in and play the sled racing game here

00:06:54Which was my absolute favorite in Club Penguin and there's definitely a few things missing

00:06:59I must admit but uh, it's a pretty good first pass it did all of this in a single prompt

00:07:04I can even confirm that this version does have an attempt at the cart surfer game here

00:07:07Which was my favorite on Club Penguin, but this one seems a little broken

00:07:11You can sort of just go from side to side and now I think I'm under the map. It's also really dark now

00:07:15This is what the other model gave me and I want you to put in the comments

00:07:18Which model you think did a better job and if you can work out which model made each version?

00:07:22I'll tell you at the end of this test you see in this one

00:07:25We have the same color selectors that was in the prompt

00:07:27We also have the hat and accessory here. So I'll choose crown this time and we click start exploring

00:07:31The penguins a little chunkier in this version. I must say it's it's funnier looking but again, I gave this no assets

00:07:36This is just from scratch in 3js

00:07:38It has the same problem where you can sort of walk through your buildings

00:07:41But we do have the map and we have all of the different zones here

00:07:44So if I go over to the ski village

00:07:46I should be able to play the game so I can play sled racing here and to be honest

00:07:50This is pretty similar to the other version of the sled racing game that we had

00:07:53You can see we have some of the trees coming up in the distance here

00:07:56We have three lives and the life counter does actually work

00:07:58But it doesn't seem like we can jump in this version

00:08:01This model though did also give me a version of the cart surfer game

00:08:04But again, this one is a little weird

00:08:06Although I guess it's more functional because you can actually see things in this version and you can jump but uh

00:08:11I'm not sure where I'm actually surfing. There is no sort of rail and overall. Yeah, it's it's not the cart surfer game

00:08:17I remember from Club Penguin overall though

00:08:19I'm always impressed with all these models can do in a single prompt especially with 3js and if you're wondering which model did which the

00:08:25First one was opus 4.6 and the second one was codex 5.3, and I actually think I prefer the first one

00:08:30So I think opus 4.6 wins on my Club Penguin test now the final test

00:08:34I ran on these models was to see how good they are at UI design those models are getting pretty good at this

00:08:38So I gave both of them prompt to build me a landing page for an AI only social media site

00:08:42So similar to molt book and the page should be snarky and emphasizes the future and for AI only and do this all in a single

00:08:49HTML file this is the result I got back from both the prompts then and I must admit I am very impressed with codex here

00:08:55We have codex 5.3 on the left and opus 4.6 on the right and I just really like the way that codex

00:09:005.3 went with this site

00:09:01It's gone for a neo brutalism design and it's just a little more fun than some of the other vibe coded sites

00:09:06I think opus 4.6 here while being a good design just looks like a typical vibe coded app. It's done it very well

00:09:13I must admit but again

00:09:14It's got these purple gradients and everything about this just screams that it was vibe coded or as I think the codex

00:09:205.3 version looks like someone has had a bit more manual input maybe prompted it to go in that direction

00:09:25Even though I gave them the exact same prompt

00:09:27The only thing that I think opus 4.6 did a little bit better is the page is actually a little more functional

00:09:32You can see we have this sort of trending tab down here. We have rules top models of the week

00:09:36We have popular subreddits and also a popular feed whereas the codex 5.3 one is a little more bare

00:09:41And we sort of just have this trending tab down here and that is it

00:09:44So I'm definitely curious to see how these score on design arena as they just came out

00:09:47So they're not ranked yet, but the moment GLM 4.7 is currently the leader

00:09:51So I want to see if 5.3 codex or opus 4.6 can take that crown overall

00:09:55Both of the models are pretty capable and it's quite hard to tell which one is going to be the best

00:09:59I think personally I'll probably lean towards 5.3 codex

00:10:03But purely because I like the codex app and just the overall experience I've had with prompting open AI models if we want to compare

00:10:09Them on the benchmarks though as I mentioned in the intro codex has a massive advantage on terminal bench 2.0

00:10:15Which is actually a pretty incredible leap and that's basically the only benchmark that we can currently compare as I don't think

00:10:21Anthropic were ready for open AI to release this model yet and annoyingly they don't use the same benchmarks in their blog posts

00:10:28I did check artificial analysis and so far they've only benchmarked opus 4.6 for coding but only the non reasoning version as well

00:10:35But I guess it's pretty impressive that the non reasoning version of 4.6 actually performs as well as the reasoning version of 4.5

00:10:42Opus, my personal feeling at the moment is the opus 4.5 to 4.6 is a little more marginal than 5.2 codexes to 5.3

00:10:49But I'm gonna have to use both of them and see how they feel in the real world

00:10:53There's a final few extras in both of these releases

00:10:55And one of the coolest ones is that both of the models apparently have improved cyber security capabilities with open AI saying that GPT

00:11:015.3 codex is the first model they classify as high capability for cybersecurity related tasks and the first they've actually directly trained to identify

00:11:09Software vulnerabilities and anthropic basically says the same in this long blog post now one feature of codex that I'm really expecting to like

00:11:16Is it can actually be steered while it's working they say instead of waiting for a final output

00:11:21You can actually interact in real time asking questions and discussing approaches and steering it toward a solution

00:11:27And I just think this approach is a little bit better as I'm always debating whether I should let the model finish first or if I

00:11:32Should interrupt it stop it what it's doing when I want it to make changes

00:11:35And I just think especially when we now have tasks that can run for significant lengths of time

00:11:40This is going to be a much nicer user experience. We can actually talk to it while it's working

00:11:44Finally, we have a few new features for Claude as well. The first one is include code

00:11:48You can now use agent teams to work on tasks together aka sub agents Richard actually made a video on this earlier this week

00:11:55So check that out if you're interested in learning more and there was also some cool API features like Claude now has a compaction feature

00:12:01Built into the API so you can actually use that to summarize its context and perform a longer running tasks

00:12:06And there's also a new adaptive thinking mode

00:12:08So essentially you just let the model pick up on contextual clues to see how much it should actually use its extended thinking

00:12:13There we go coding models have come a seriously long way

00:12:16If you didn't know it's actually not even been a year since Claude code was released

00:12:20Let me know what you think of all of these models in the comments while you're there subscribe and as always see you in the next one

00:12:31(upbeat music)

Key Takeaway

The competition between OpenAI and Anthropic has reached a fever pitch with the near-simultaneous release of GPT 5.3 Codex and Claude Opus 4.6, both showing massive leaps in autonomous coding, long-context reasoning, and cybersecurity.

Highlights

OpenAI's GPT 5.3 Codex has surpassed Anthropic's Claude Opus 4.6 on Terminal Bench 2.0 by over 10%.
Claude Opus 4.6 introduces a 1-million-token context window (in beta) and improved agentic planning capabilities.
GPT 5.3 Codex is 25% faster than its predecessor and integrates reasoning with professional knowledge for complex execution.
Real-world coding tests showed Opus 4.6 followed best practices better for library functions, while Codex 5.3 excelled at new API features like tool approval.
Both models now feature high-tier cybersecurity capabilities, with Codex being directly trained to identify software vulnerabilities.
OpenAI introduced a 'steering' feature for Codex 5.3, allowing users to interact and guide the model in real-time while it works.

Timeline

The Benchmark Battle: Opus 4.6 vs. Codex 5.3

The video opens with the rapid-fire release of two major AI models, Anthropic's Claude Opus 4.6 and OpenAI's GPT 5.3 Codex. While Opus 4.6 initially set a high bar on Terminal Bench 2.0, the speaker notes that Codex 5.3 quickly overtook it by a margin of 10%. This section highlights the intense pace of the industry, where a model's 'reign' at the top can last only minutes. The speaker expresses a personal preference for the feel of GPT 5.2 and sets the stage for a head-to-head comparison. This context is crucial for understanding the current state of 'state-of-the-art' AI benchmarks.

Model Specs and New Features Breakdown

The speaker provides a TL;DR of the technical improvements claimed by both AI labs. Anthropic focuses on reliable agentic tasks and a massive 1-million-token context window, though it comes with a tiered pricing structure for prompts over 200,000 tokens. OpenAI positions Codex 5.3 as an all-rounder that is 25% faster and combines the reasoning of GPT 5.2 with enhanced coding logic. This section matters because it moves past 'marketing speak' to outline the specific utility of long-running research and tool use tasks. The mention of specific pricing, such as $37.50 for a million output tokens, provides concrete data for developers.

Real-World Test: Upgrading the AI SDK

The first major test involves migrating a Convex agent package to the AI SDK v6, a task involving many breaking changes and complex type errors. Both models ran for approximately 40 minutes autonomously, with Codex 5.3 successfully completing the task in one prompt while Opus 4.6 required two. The speaker observes that Codex identified the mono-repo structure and root causes effectively before starting its coding run. Ultimately, both models produced working code with all tests passing, demonstrating high reliability for complex refactoring. This segment proves that modern models can handle significant, uninterrupted engineering tasks.

Code Review and Comparative Logic

A side-by-side code review reveals significant differences in how the two models approached the migration. While Codex correctly implemented a new 'tool approval' logic feature, it failed by creating a redundant custom function instead of using an existing SDK utility. In contrast, Opus 4.6 correctly utilized the built-in 'convert to model messages' function, showing better adherence to library standards. Interestingly, when the speaker asked Codex to review Opus's code, the model admitted that the Opus version had better architecture. This demonstrates the importance of manual code review even when AI produces 'working' code.

Creative Coding: The Club Penguin Challenge

To test creativity and asset-free development, the speaker prompts both models to build a Club Penguin clone using 3js. Both versions included functional avatar customization, map navigation, and minigames like sled racing and cart surfing. The speaker notes that the Opus 4.6 version felt more faithful to the original game's layout, while the Codex version was a bit 'chunkier' and funny-looking. Despite minor bugs like falling through the map or missing assets, the ability to build a 3D environment in a single prompt is highlighted as impressive. This test serves as a proxy for how these models handle spatial reasoning and game logic.

UI Design Arena and Social Media Landing Pages

The final qualitative test focuses on UI design, where the models are asked to build a 'snarky' landing page for an AI-only social media site. Codex 5.3 opted for a unique 'neo-brutalism' design, while Opus 4.6 produced a more standard, 'vibe-coded' modern layout with purple gradients. While Opus provided more functional page elements like trending tabs and rules, the speaker preferred the distinct aesthetic of Codex's design. The section references the 'Design Arena' rankings, where GLM 4.7 is the current leader, awaiting the updated scores for these new releases. This highlights the growing role of LLMs in front-end design and aesthetic decision-making.

Cybersecurity, Real-Time Steering, and Final Verdict

The video concludes with a look at advanced features like high-capability cybersecurity training and real-time model steering. The speaker highlights OpenAI's new ability to discuss and guide the model while it is mid-task as a major user experience improvement. Anthropic's 'adaptive thinking mode' and 'sub-agent' teams are also mentioned as powerful new tools for complex workflows. While the benchmark data heavily favors Codex 5.3, the speaker admits that real-world 'feel' is still marginal and requires more testing. The closing remarks emphasize that these models have evolved significantly in less than a year since the release of Claude Code.

Community Posts

Write about this video