It’s Broken… The Claude Code Vs Codex Debate Is Finally Over
AAI LABS
Computing/SoftwareSmall Business/StartupsInternet Technology
Transcript
00:00:00For a long time, everyone's go-to model for coding was clogged.
00:00:03Not only because it performed well, but because there weren't other options on the same tier.
00:00:07Then GPT models stepped up and closed the gap, especially with the release of GPT 5.5, which
00:00:12brought it down to almost none.
00:00:14To compare the two we needed to put them in the environments designed best for them, which
00:00:18means their own CLIs.
00:00:19So we're putting Opus 4.7 and GPT 5.5 to the test, to see how they perform against each
00:00:25other.
00:00:26We'll test them across 9 categories to find out which one actually comes out on top and
00:00:29by the end, you'll know which one earns the spot in your workflows.
00:00:33Usability is where Clawed Code starts breaking down for us.
00:00:36We've been using it for most of our tasks, coding and non-coding, but it was only good
00:00:40until the 2.1.0 update.
00:00:43After that, things started going downhill for Clawed Code.
00:00:46The UI is the most frustrating part because it has the biggest impact on the experience.
00:00:50The terminal glitches, rendering breaks, and a lot of what used to feel polished now feels
00:00:55off.
00:00:56It used to be one of the best TUIs, but only until it started being vibe coded.
00:00:59It now feels more broken with multiple bugs like rendering issues, cachey leaks, about
00:01:03which not only us were complaining.
00:01:05The bigger problem is that they removed the dangerously skipped permissions mode and replaced
00:01:09it with auto mode by default.
00:01:11We used to run bypass permission mode for most of our tasks, with hooks set up for whichever
00:01:15files we didn't want Clawed to touch.
00:01:17Now it asks for permissions on even that mode, when we gave Clawed a prompt to create a skills,
00:01:22shifted to another Clawed session to do something else, and only later found that the skill creation
00:01:27was blocked by permission prompt for writing to the .clawed folder the entire time.
00:01:32We came back expecting the skills to be created, and it was just sitting there waiting.
00:01:36Codex handles this better because its YOLO mode doesn't ask for any permissions the way
00:01:40Clawed Code's auto mode does.
00:01:42The CLI is built on Rust, so the UI is much smoother than Clawed Code's React based setup,
00:01:47and even after a long session, nothing breaks.
00:01:49Personality configuration is another spot where Codex pulls ahead.
00:01:53We can set the personality to a more direct and concise language.
00:01:56This is because GPT 5.5 is significantly more sycophantic and is agreeable with every prompt
00:02:02than Opus 4.7 is.
00:02:04This is why changing the personality in Codex prevents that default behavior in the model.
00:02:08To make Opus 4.7 direct, we have to rely on instructions in Clawed.md, while Codex does
00:02:14that with just a setting change.
00:02:16Pre-installed skills are another difference.
00:02:18Codex ships with many that Clawed Code doesn't have, including the agent browser skill.
00:02:22That matters for anyone building apps, because in Codex we don't need to explicitly connect
00:02:26MCPs for browser verification.
00:02:29It does that automatically after implementing any feature.
00:02:31It also has a built-in skill creator, so when we want a new skill, it generates a complete
00:02:35one with the right structure and reference files.
00:02:38In Clawed, we'd need to install the skill creator separately to get a properly structured
00:02:42skill.
00:02:43Otherwise, it just writes an MD file.
00:02:45Now there are still two things Clawed Code does better.
00:02:47Codex doesn't offer rewinding, which is a feature we use the most, so not having it is
00:02:51a real downside.
00:02:52Clawed Code also lets us view its thinking by expanding it with Ctrl+O, which Codex doesn't
00:02:57do well.
00:02:58Viewing the reasoning is helpful because we can correct the approach mid-task instead of
00:03:02waiting for the implementation to finish and then redoing it.
00:03:05So looking at how Clawed Code's user experience degrades with each new update, Codex gets a
00:03:10point for usability.
00:03:11On cost, Clawed Code is the more expensive tool by a wide margin.
00:03:15Not in terms of actual prices, but by usability per same price.
00:03:19Clawed Code is not available on the free tier at all and is only available starting from
00:03:23the Pro and Max plans.
00:03:24The plans have nearly identical pricing.
00:03:26The Pro plan is basically unusable for any good scale application because it hits its
00:03:30limits on just a few tasks.
00:03:32We can't even properly use Opus 4.7 for any meaningful task on Pro.
00:03:36The limits run out very quickly even on the Max plan that we use.
00:03:39Codex is in a better position from the start.
00:03:41It's available even on the free plan with limited usage.
00:03:44Both use a similar 5-hour window mechanism, so to see which one gets more work done we
00:03:49ran them on tasks of the same scale.
00:03:51Clawed Code already has a context command that visualizes how many tokens a session has used,
00:03:56but Codex doesn't have a built-in equivalent, so we had to find a workaround for the comparison.
00:04:00Both tools store their sessions as JSON files, just organized differently.
00:04:04So we built a small tool that reads them and counts the tokens used in each session.
00:04:08On the same app and a similar level of debugging, Opus 4.7 burned through 173,000 tokens while
00:04:15GPT 5.5 used only 82,000.
00:04:18This is because GPT 5.5 gets work done in fewer tokens and far fewer retries.
00:04:23So Codex lasted significantly longer and turned out to be far more cost-efficient for the same work.
00:04:28But before we move forwards, let's have a word by our sponsor, Stream.
00:04:32You're building an app and your users need to talk, stream, and connect.
00:04:35You try handling that yourself 3 months later, you're still debugging instead of shipping.
00:04:39Stream skips all of that.
00:04:40Stream gives you everything out of the box from in-app chat and video calling to activity
00:04:44feeds and AI moderation so you're shipping features, not building infrastructure from scratch.
00:04:49We're talking WhatsApp-style messaging, Zoom-style video calls, and Instagram-style feeds all built in.
00:04:55What really stands out is Stream's new launch, Vision Agents.
00:04:58You can build intelligent AI agents that see, hear, and act on live video and audio, all
00:05:02in Python with just a few lines of code.
00:05:05Everything runs on a global edge network for low latency everywhere.
00:05:08From startups to scaling apps, leading platforms across social, fitness, and community rely
00:05:13on Stream to power over a billion end users.
00:05:16If you're a developer building the next big app, Stream scales with you from day one.
00:05:20Start for free at getstream.io, links in the pinned comment.
00:05:24The real test for the two models is on how they build products.
00:05:27As we said before, GPT 5.5 is faster and consumes fewer tokens, so it ships working apps quicker.
00:05:33Opus 4.7 spends more tokens on thinking, plans deeper, and iterates on all aspects of the
00:05:38app at the same time.
00:05:40Planning was the first thing we wanted to test.
00:05:42We've been using Clod Code's planning mode for a long time.
00:05:45It covers most things, has some flaws, but is still quite usable.
00:05:48So we wanted to see how GPT 5.5 performs at planning, because OpenAI claims it does better
00:05:53at planning tasks and executing them.
00:05:55We enabled plan mode and opened it in a folder that already contained a backend for an app
00:06:00an API built using FastAPI and asked it to build the frontend for it.
00:06:04It explored the project thoroughly and asked a few questions, but the questions were fairly
00:06:08simple.
00:06:09It could have gone deeper into how we wanted the frontend to look, because for frontend
00:06:13work, that matters.
00:06:14The plan it produced was very simple.
00:06:16It included a summary of the main flow, the key changes, the pages to add, and how to test
00:06:20them.
00:06:21The one thing it did well was clearly separating its assumptions, so we knew exactly what it
00:06:25was taking for granted.
00:06:26We told it to proceed and it finished in about 8 minutes.
00:06:28The same task on Claude Code took 24 minutes.
00:06:31But Opus 4.7's plan was much more in-depth, considered more aspects of the application,
00:06:36and even pulled in ShadC and UI to improve the user experience.
00:06:39So Opus 4.7 does better in terms of planning.
00:06:42Next, we wanted to test both on a Greenfield app.
00:06:45We gave them the same prompt that is to create a mono repo with a Python Flask backend and
00:06:50a Next.js frontend, along with the full pipeline and key requirements for how the app should
00:06:55work.
00:06:56It switched into planning mode by itself because of its harness design.
00:06:59Codex did not switch into planning mode and instead started implementing directly.
00:07:04It finished much faster than Claude Code, which took around 16 minutes because of the planning
00:07:08step.
00:07:09GPT 5.5's version of the app had a much simpler UI and mainly focused on making sure the app
00:07:14worked.
00:07:15It didn't work properly at the start, so we debugged it iteratively.
00:07:17One thing we noticed was that the interview prompts were hardcoded because we hadn't provided
00:07:22any API key.
00:07:23The prompt specified using the Gemini API as a backend, but since no key was available,
00:07:27it implemented a fallback so the app wouldn't crash completely.
00:07:30Codex had actually used local follow-up questions without any explicit prompting.
00:07:35We like this because fallback mechanisms like these are useful in production since they prevent
00:07:39crashes.
00:07:40After a few iterations and adding the API key, the app's flow worked properly even though
00:07:44the UI was still simple.
00:07:46So GPT 5.5 looked at the edge cases and implemented mechanisms to fill in the gaps.
00:07:51Opus 4.7, on the other hand, asked us to give it the API key before it started implementation
00:07:57and built the entire app around that.
00:07:59So Opus 4.7, unlike GPT 5.5, didn't prepare for fallbacks and just needed everything available
00:08:05up front.
00:08:06Due to this, when the API wasn't actually there, the app had no fallback and just gave an error.
00:08:10Claude Code does focus on user experience and functionality together, so its implementation
00:08:15looked more realistic.
00:08:16This is Opus 4.7's UI strength showing up, which we covered in our previous video where
00:08:21we said Opus 4.7 is way better at handling the UI, but its implementation also had issues.
00:08:26When we asked it to debug, it didn't directly inspect the implementation like Codex did.
00:08:31Instead, it started asking us questions about what might be causing the problem and relied
00:08:35on our testing.
00:08:36It added debug points like indicators in the UI and console logs and asked us to check states
00:08:41and report back.
00:08:42After a back and forth, it eventually fixed the issue and the interview feature worked.
00:08:46We preferred how Codex used the agent browser to debug on its own.
00:08:49So in terms of autonomous working, Codex's implementation was better, and in terms of
00:08:53user experience, Claude Code did a way better job.
00:08:56We also wanted to test how both handled the init command.
00:08:59Claude Code's init runs without expanding the prompt inline.
00:09:02It creates a simple Claude.md file that's around 90 lines and includes architecture, app flow,
00:09:08front-end and back-end structure, and all required commands to run the app.
00:09:12A lot of that information is redundant and doesn't really benefit the agent, which is
00:09:15why it isn't always necessary to keep all of it.
00:09:18Codex's setup was more refined.
00:09:20It included commit guidelines, pull request guidelines, and security instructions properly
00:09:24while keeping the project structure section brief instead of overloading it with detail.
00:09:28Neither was perfect, but Codex handled agents.md better.
00:09:32Now we also wanted to test how both perform on code review.
00:09:35We gave the same prompt for a reliability review to both Codex and Claude Code, asking them
00:09:40to document the review in separate files while working on the same codebase.
00:09:44Once both had generated their reports, we opened a new session and asked Claude to output the
00:09:48diff between the two files, comparing the findings.
00:09:51Claude's review was much more detailed.
00:09:53It organized every finding by priority and included components, the exact code snippets
00:09:57behind the issues.
00:09:59Codex's report mentioned line numbers but did not include the actual code snippets.
00:10:03Both reports were thorough, sharing several findings while each caught a few the other
00:10:07missed.
00:10:08Claude Code also reported security issues like a leaked API key and a vulnerability.
00:10:12The task was a reliability review though and those issues were outside the scope.
00:10:17Claude Code reported every extra problem it ran into along the way while Codex stayed strictly
00:10:21on reliability.
00:10:22So Codex's report was more aligned with the original request while Claude Code's was broader
00:10:27but less focused on the specific task.
00:10:29If we had to describe both in terms of building, GPT 5.5 feels more like a backend engineer
00:10:34focused on getting the application's functionality delivered correctly first while Opus 4.7 feels
00:10:40more like a full stack engineer trying to balance both functionality and user experience.
00:10:45On context management, Codex performed much better than Claude Code.
00:10:48Claude Code has in-session context editing which removes tool calls and reasoning steps
00:10:53that no longer matter from the conversation.
00:10:55It clears redundant information from the session to avoid bloat.
00:10:58The compaction isn't perfect but at least it doesn't keep unnecessary parts in the context
00:11:02while compacting.
00:11:03Codex doesn't edit their context.
00:11:05It compacts the entire conversation just as it took place.
00:11:08The one thing it does better is preserving the last 20,000 tokens in memory and not compacting
00:11:13that portion at all.
00:11:14That helps prevent performance degradation in Codex after compaction so that the conversation
00:11:18can flow smoothly from the next prompt onward.
00:11:21We tested its performance and Codex performed better after compaction than Claude Code did.
00:11:25So even though Claude Code follows a more detailed multi-step compaction process, Codex's preserved
00:11:30tail keeps the agent more useful in practice.
00:11:33Memory works differently between the two.
00:11:35Claude Code's harness is mostly stateless across sessions, meaning each session starts
00:11:39without any context from the previous one.
00:11:41It now has a memory feature that can store persistent preferences or instructions.
00:11:46So if we tell it to avoid doing something a certain way, it stores that and applies it
00:11:50again later within the same project.
00:11:52That helps when working repeatedly in a single project.
00:11:54But the memory is project scoped, so switching projects loses that stored behavior.
00:11:58Codex takes the opposite route.
00:12:00It consolidates information from multiple sessions over time and builds a global memory across
00:12:05interactions so it can retain patterns beyond a single project.
00:12:08That can help consistency across different tasks.
00:12:11So in short, Claude Code keeps memory more contained within a project while Codex takes
00:12:15a more cross-session, cross-project approach which changes how each of them adapts over
00:12:19time.
00:12:20Since Claude Code has been around for longer and is being developed constantly to improve
00:12:24developer experience, it has more to offer compared to Codex.
00:12:27Claude Code has a hook system which lets us run our own scripts at specific points in the
00:12:32agent's lifecycle, like before or after a tool runs, among other points, for things
00:12:36like blocking unsafe commands, running formatters, and more.
00:12:39We can also run sub-agents in a dedicated work tree so their performance doesn't affect
00:12:43each other.
00:12:44We can control the effort level for the models, and we can even use keywords like "ultra-think"
00:12:48to push reasoning to its maximum on a specific task.
00:12:51None of that has an equivalent in Codex right now.
00:12:54The ecosystem is the other clear win for Claude Code.
00:12:56We can run sessions through the Claude desktop app and delegate tasks from the mobile app.
00:13:01Across Claude Code, the desktop app, web app, and browser extensions, the surface is much
00:13:06wider than Codex, which mainly consists of a web app and a desktop app that was only recently
00:13:11released and didn't feel as strong at the time we tested it.
00:13:14Sessions also move between environments more easily on Claude Code, which makes it more
00:13:18convenient to work across different interfaces.
00:13:20Codex also has many interesting features.
00:13:22In the cloud, it has an attempt flag that runs the same task n times.
00:13:26It produces several implementations and selects the best one.
00:13:29Claude Code can do something similar but only through configurations and instructions, not
00:13:33as a flag.
00:13:34The other Codex-only feature, which sets it apart from the rest, is its integration with
00:13:38OpenAI's image models.
00:13:39It can use them directly in the CLI to generate images for the websites it's working on.
00:13:44Claude relies mostly on SVG-based generation for visuals, which doesn't even compete on
00:13:49quality because it doesn't have any image model yet.
00:13:52If we're building a UI that needs real imagery, Codex is the only one of the two that does
00:13:56it, without even being explicitly told to.
00:13:58Also, if you are enjoying our content, consider pressing the hype button because it helps us
00:14:03create more content like this and reach out to more people.
00:14:06Both use subagents, even though the concept was introduced by Claude first.
00:14:10Since it came first in Claude Code, its integration is more mature because it has been agent-centric
00:14:15and focused on the coding experience for way longer than OpenAI.
00:14:19It supports agents that can be orchestrated through remote sessions, while Codex mainly
00:14:23supports multi-agent workflows inside the terminal environment.
00:14:27The biggest difference is how each invokes subagents.
00:14:29Claude Code can spawn agents without explicit invocation, while Codex only creates an agent
00:14:35if we explicitly ask for one in the prompt.
00:14:37When Codex spawns agents, it names them and pass them a proper prompt as well.
00:14:41In coding performance, the two are fairly similar, but the design choices behind them are different.
00:14:46Claude Code's subagents use an explicit allow list, meaning the parent agent defines exactly
00:14:51which tools the subagent can access, while Codex subagents inherit tool access from the
00:14:55parent by default.
00:14:57Claude Code also gives every subagent a completely fresh context window.
00:15:01A subagent doesn't have access to the conversation history and only sees the prompt from the parent,
00:15:06plus the system prompt and any global rules, because Claude focuses on context isolation.
00:15:10Codex CLI does the opposite.
00:15:12It forks the full history into the subagent session, with the parent's prompt layered on top.
00:15:17Codex agents retain more context about what's already been discussed, which does help improve
00:15:22their performance.
00:15:23In practice, Claude Code's strict isolation hurt our research subagents.
00:15:27When we used them, the results weren't good enough, because they only saw the immediate
00:15:30prompt and didn't have any prior context.
00:15:33Codex agents get the whole history, can iterate more effectively, and perform better on tasks
00:15:38where continuity matters.
00:15:39That brings us to the end of this video.
00:15:41If you'd like to support the channel and help us keep making videos like this, you can do
00:15:45so by using the super thanks button below.
00:15:48As always, thank you for watching and I'll see you in the next one.