It’s Broken… The Claude Code Vs Codex Debate Is Finally Over

AAI LABS
Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00For a long time, everyone's go-to model for coding was clogged.
00:00:03Not only because it performed well, but because there weren't other options on the same tier.
00:00:07Then GPT models stepped up and closed the gap, especially with the release of GPT 5.5, which
00:00:12brought it down to almost none.
00:00:14To compare the two we needed to put them in the environments designed best for them, which
00:00:18means their own CLIs.
00:00:19So we're putting Opus 4.7 and GPT 5.5 to the test, to see how they perform against each
00:00:25other.
00:00:26We'll test them across 9 categories to find out which one actually comes out on top and
00:00:29by the end, you'll know which one earns the spot in your workflows.
00:00:33Usability is where Clawed Code starts breaking down for us.
00:00:36We've been using it for most of our tasks, coding and non-coding, but it was only good
00:00:40until the 2.1.0 update.
00:00:43After that, things started going downhill for Clawed Code.
00:00:46The UI is the most frustrating part because it has the biggest impact on the experience.
00:00:50The terminal glitches, rendering breaks, and a lot of what used to feel polished now feels
00:00:55off.
00:00:56It used to be one of the best TUIs, but only until it started being vibe coded.
00:00:59It now feels more broken with multiple bugs like rendering issues, cachey leaks, about
00:01:03which not only us were complaining.
00:01:05The bigger problem is that they removed the dangerously skipped permissions mode and replaced
00:01:09it with auto mode by default.
00:01:11We used to run bypass permission mode for most of our tasks, with hooks set up for whichever
00:01:15files we didn't want Clawed to touch.
00:01:17Now it asks for permissions on even that mode, when we gave Clawed a prompt to create a skills,
00:01:22shifted to another Clawed session to do something else, and only later found that the skill creation
00:01:27was blocked by permission prompt for writing to the .clawed folder the entire time.
00:01:32We came back expecting the skills to be created, and it was just sitting there waiting.
00:01:36Codex handles this better because its YOLO mode doesn't ask for any permissions the way
00:01:40Clawed Code's auto mode does.
00:01:42The CLI is built on Rust, so the UI is much smoother than Clawed Code's React based setup,
00:01:47and even after a long session, nothing breaks.
00:01:49Personality configuration is another spot where Codex pulls ahead.
00:01:53We can set the personality to a more direct and concise language.
00:01:56This is because GPT 5.5 is significantly more sycophantic and is agreeable with every prompt
00:02:02than Opus 4.7 is.
00:02:04This is why changing the personality in Codex prevents that default behavior in the model.
00:02:08To make Opus 4.7 direct, we have to rely on instructions in Clawed.md, while Codex does
00:02:14that with just a setting change.
00:02:16Pre-installed skills are another difference.
00:02:18Codex ships with many that Clawed Code doesn't have, including the agent browser skill.
00:02:22That matters for anyone building apps, because in Codex we don't need to explicitly connect
00:02:26MCPs for browser verification.
00:02:29It does that automatically after implementing any feature.
00:02:31It also has a built-in skill creator, so when we want a new skill, it generates a complete
00:02:35one with the right structure and reference files.
00:02:38In Clawed, we'd need to install the skill creator separately to get a properly structured
00:02:42skill.
00:02:43Otherwise, it just writes an MD file.
00:02:45Now there are still two things Clawed Code does better.
00:02:47Codex doesn't offer rewinding, which is a feature we use the most, so not having it is
00:02:51a real downside.
00:02:52Clawed Code also lets us view its thinking by expanding it with Ctrl+O, which Codex doesn't
00:02:57do well.
00:02:58Viewing the reasoning is helpful because we can correct the approach mid-task instead of
00:03:02waiting for the implementation to finish and then redoing it.
00:03:05So looking at how Clawed Code's user experience degrades with each new update, Codex gets a
00:03:10point for usability.
00:03:11On cost, Clawed Code is the more expensive tool by a wide margin.
00:03:15Not in terms of actual prices, but by usability per same price.
00:03:19Clawed Code is not available on the free tier at all and is only available starting from
00:03:23the Pro and Max plans.
00:03:24The plans have nearly identical pricing.
00:03:26The Pro plan is basically unusable for any good scale application because it hits its
00:03:30limits on just a few tasks.
00:03:32We can't even properly use Opus 4.7 for any meaningful task on Pro.
00:03:36The limits run out very quickly even on the Max plan that we use.
00:03:39Codex is in a better position from the start.
00:03:41It's available even on the free plan with limited usage.
00:03:44Both use a similar 5-hour window mechanism, so to see which one gets more work done we
00:03:49ran them on tasks of the same scale.
00:03:51Clawed Code already has a context command that visualizes how many tokens a session has used,
00:03:56but Codex doesn't have a built-in equivalent, so we had to find a workaround for the comparison.
00:04:00Both tools store their sessions as JSON files, just organized differently.
00:04:04So we built a small tool that reads them and counts the tokens used in each session.
00:04:08On the same app and a similar level of debugging, Opus 4.7 burned through 173,000 tokens while
00:04:15GPT 5.5 used only 82,000.
00:04:18This is because GPT 5.5 gets work done in fewer tokens and far fewer retries.
00:04:23So Codex lasted significantly longer and turned out to be far more cost-efficient for the same work.
00:04:28But before we move forwards, let's have a word by our sponsor, Stream.
00:04:32You're building an app and your users need to talk, stream, and connect.
00:04:35You try handling that yourself 3 months later, you're still debugging instead of shipping.
00:04:39Stream skips all of that.
00:04:40Stream gives you everything out of the box from in-app chat and video calling to activity
00:04:44feeds and AI moderation so you're shipping features, not building infrastructure from scratch.
00:04:49We're talking WhatsApp-style messaging, Zoom-style video calls, and Instagram-style feeds all built in.
00:04:55What really stands out is Stream's new launch, Vision Agents.
00:04:58You can build intelligent AI agents that see, hear, and act on live video and audio, all
00:05:02in Python with just a few lines of code.
00:05:05Everything runs on a global edge network for low latency everywhere.
00:05:08From startups to scaling apps, leading platforms across social, fitness, and community rely
00:05:13on Stream to power over a billion end users.
00:05:16If you're a developer building the next big app, Stream scales with you from day one.
00:05:20Start for free at getstream.io, links in the pinned comment.
00:05:24The real test for the two models is on how they build products.
00:05:27As we said before, GPT 5.5 is faster and consumes fewer tokens, so it ships working apps quicker.
00:05:33Opus 4.7 spends more tokens on thinking, plans deeper, and iterates on all aspects of the
00:05:38app at the same time.
00:05:40Planning was the first thing we wanted to test.
00:05:42We've been using Clod Code's planning mode for a long time.
00:05:45It covers most things, has some flaws, but is still quite usable.
00:05:48So we wanted to see how GPT 5.5 performs at planning, because OpenAI claims it does better
00:05:53at planning tasks and executing them.
00:05:55We enabled plan mode and opened it in a folder that already contained a backend for an app
00:06:00an API built using FastAPI and asked it to build the frontend for it.
00:06:04It explored the project thoroughly and asked a few questions, but the questions were fairly
00:06:08simple.
00:06:09It could have gone deeper into how we wanted the frontend to look, because for frontend
00:06:13work, that matters.
00:06:14The plan it produced was very simple.
00:06:16It included a summary of the main flow, the key changes, the pages to add, and how to test
00:06:20them.
00:06:21The one thing it did well was clearly separating its assumptions, so we knew exactly what it
00:06:25was taking for granted.
00:06:26We told it to proceed and it finished in about 8 minutes.
00:06:28The same task on Claude Code took 24 minutes.
00:06:31But Opus 4.7's plan was much more in-depth, considered more aspects of the application,
00:06:36and even pulled in ShadC and UI to improve the user experience.
00:06:39So Opus 4.7 does better in terms of planning.
00:06:42Next, we wanted to test both on a Greenfield app.
00:06:45We gave them the same prompt that is to create a mono repo with a Python Flask backend and
00:06:50a Next.js frontend, along with the full pipeline and key requirements for how the app should
00:06:55work.
00:06:56It switched into planning mode by itself because of its harness design.
00:06:59Codex did not switch into planning mode and instead started implementing directly.
00:07:04It finished much faster than Claude Code, which took around 16 minutes because of the planning
00:07:08step.
00:07:09GPT 5.5's version of the app had a much simpler UI and mainly focused on making sure the app
00:07:14worked.
00:07:15It didn't work properly at the start, so we debugged it iteratively.
00:07:17One thing we noticed was that the interview prompts were hardcoded because we hadn't provided
00:07:22any API key.
00:07:23The prompt specified using the Gemini API as a backend, but since no key was available,
00:07:27it implemented a fallback so the app wouldn't crash completely.
00:07:30Codex had actually used local follow-up questions without any explicit prompting.
00:07:35We like this because fallback mechanisms like these are useful in production since they prevent
00:07:39crashes.
00:07:40After a few iterations and adding the API key, the app's flow worked properly even though
00:07:44the UI was still simple.
00:07:46So GPT 5.5 looked at the edge cases and implemented mechanisms to fill in the gaps.
00:07:51Opus 4.7, on the other hand, asked us to give it the API key before it started implementation
00:07:57and built the entire app around that.
00:07:59So Opus 4.7, unlike GPT 5.5, didn't prepare for fallbacks and just needed everything available
00:08:05up front.
00:08:06Due to this, when the API wasn't actually there, the app had no fallback and just gave an error.
00:08:10Claude Code does focus on user experience and functionality together, so its implementation
00:08:15looked more realistic.
00:08:16This is Opus 4.7's UI strength showing up, which we covered in our previous video where
00:08:21we said Opus 4.7 is way better at handling the UI, but its implementation also had issues.
00:08:26When we asked it to debug, it didn't directly inspect the implementation like Codex did.
00:08:31Instead, it started asking us questions about what might be causing the problem and relied
00:08:35on our testing.
00:08:36It added debug points like indicators in the UI and console logs and asked us to check states
00:08:41and report back.
00:08:42After a back and forth, it eventually fixed the issue and the interview feature worked.
00:08:46We preferred how Codex used the agent browser to debug on its own.
00:08:49So in terms of autonomous working, Codex's implementation was better, and in terms of
00:08:53user experience, Claude Code did a way better job.
00:08:56We also wanted to test how both handled the init command.
00:08:59Claude Code's init runs without expanding the prompt inline.
00:09:02It creates a simple Claude.md file that's around 90 lines and includes architecture, app flow,
00:09:08front-end and back-end structure, and all required commands to run the app.
00:09:12A lot of that information is redundant and doesn't really benefit the agent, which is
00:09:15why it isn't always necessary to keep all of it.
00:09:18Codex's setup was more refined.
00:09:20It included commit guidelines, pull request guidelines, and security instructions properly
00:09:24while keeping the project structure section brief instead of overloading it with detail.
00:09:28Neither was perfect, but Codex handled agents.md better.
00:09:32Now we also wanted to test how both perform on code review.
00:09:35We gave the same prompt for a reliability review to both Codex and Claude Code, asking them
00:09:40to document the review in separate files while working on the same codebase.
00:09:44Once both had generated their reports, we opened a new session and asked Claude to output the
00:09:48diff between the two files, comparing the findings.
00:09:51Claude's review was much more detailed.
00:09:53It organized every finding by priority and included components, the exact code snippets
00:09:57behind the issues.
00:09:59Codex's report mentioned line numbers but did not include the actual code snippets.
00:10:03Both reports were thorough, sharing several findings while each caught a few the other
00:10:07missed.
00:10:08Claude Code also reported security issues like a leaked API key and a vulnerability.
00:10:12The task was a reliability review though and those issues were outside the scope.
00:10:17Claude Code reported every extra problem it ran into along the way while Codex stayed strictly
00:10:21on reliability.
00:10:22So Codex's report was more aligned with the original request while Claude Code's was broader
00:10:27but less focused on the specific task.
00:10:29If we had to describe both in terms of building, GPT 5.5 feels more like a backend engineer
00:10:34focused on getting the application's functionality delivered correctly first while Opus 4.7 feels
00:10:40more like a full stack engineer trying to balance both functionality and user experience.
00:10:45On context management, Codex performed much better than Claude Code.
00:10:48Claude Code has in-session context editing which removes tool calls and reasoning steps
00:10:53that no longer matter from the conversation.
00:10:55It clears redundant information from the session to avoid bloat.
00:10:58The compaction isn't perfect but at least it doesn't keep unnecessary parts in the context
00:11:02while compacting.
00:11:03Codex doesn't edit their context.
00:11:05It compacts the entire conversation just as it took place.
00:11:08The one thing it does better is preserving the last 20,000 tokens in memory and not compacting
00:11:13that portion at all.
00:11:14That helps prevent performance degradation in Codex after compaction so that the conversation
00:11:18can flow smoothly from the next prompt onward.
00:11:21We tested its performance and Codex performed better after compaction than Claude Code did.
00:11:25So even though Claude Code follows a more detailed multi-step compaction process, Codex's preserved
00:11:30tail keeps the agent more useful in practice.
00:11:33Memory works differently between the two.
00:11:35Claude Code's harness is mostly stateless across sessions, meaning each session starts
00:11:39without any context from the previous one.
00:11:41It now has a memory feature that can store persistent preferences or instructions.
00:11:46So if we tell it to avoid doing something a certain way, it stores that and applies it
00:11:50again later within the same project.
00:11:52That helps when working repeatedly in a single project.
00:11:54But the memory is project scoped, so switching projects loses that stored behavior.
00:11:58Codex takes the opposite route.
00:12:00It consolidates information from multiple sessions over time and builds a global memory across
00:12:05interactions so it can retain patterns beyond a single project.
00:12:08That can help consistency across different tasks.
00:12:11So in short, Claude Code keeps memory more contained within a project while Codex takes
00:12:15a more cross-session, cross-project approach which changes how each of them adapts over
00:12:19time.
00:12:20Since Claude Code has been around for longer and is being developed constantly to improve
00:12:24developer experience, it has more to offer compared to Codex.
00:12:27Claude Code has a hook system which lets us run our own scripts at specific points in the
00:12:32agent's lifecycle, like before or after a tool runs, among other points, for things
00:12:36like blocking unsafe commands, running formatters, and more.
00:12:39We can also run sub-agents in a dedicated work tree so their performance doesn't affect
00:12:43each other.
00:12:44We can control the effort level for the models, and we can even use keywords like "ultra-think"
00:12:48to push reasoning to its maximum on a specific task.
00:12:51None of that has an equivalent in Codex right now.
00:12:54The ecosystem is the other clear win for Claude Code.
00:12:56We can run sessions through the Claude desktop app and delegate tasks from the mobile app.
00:13:01Across Claude Code, the desktop app, web app, and browser extensions, the surface is much
00:13:06wider than Codex, which mainly consists of a web app and a desktop app that was only recently
00:13:11released and didn't feel as strong at the time we tested it.
00:13:14Sessions also move between environments more easily on Claude Code, which makes it more
00:13:18convenient to work across different interfaces.
00:13:20Codex also has many interesting features.
00:13:22In the cloud, it has an attempt flag that runs the same task n times.
00:13:26It produces several implementations and selects the best one.
00:13:29Claude Code can do something similar but only through configurations and instructions, not
00:13:33as a flag.
00:13:34The other Codex-only feature, which sets it apart from the rest, is its integration with
00:13:38OpenAI's image models.
00:13:39It can use them directly in the CLI to generate images for the websites it's working on.
00:13:44Claude relies mostly on SVG-based generation for visuals, which doesn't even compete on
00:13:49quality because it doesn't have any image model yet.
00:13:52If we're building a UI that needs real imagery, Codex is the only one of the two that does
00:13:56it, without even being explicitly told to.
00:13:58Also, if you are enjoying our content, consider pressing the hype button because it helps us
00:14:03create more content like this and reach out to more people.
00:14:06Both use subagents, even though the concept was introduced by Claude first.
00:14:10Since it came first in Claude Code, its integration is more mature because it has been agent-centric
00:14:15and focused on the coding experience for way longer than OpenAI.
00:14:19It supports agents that can be orchestrated through remote sessions, while Codex mainly
00:14:23supports multi-agent workflows inside the terminal environment.
00:14:27The biggest difference is how each invokes subagents.
00:14:29Claude Code can spawn agents without explicit invocation, while Codex only creates an agent
00:14:35if we explicitly ask for one in the prompt.
00:14:37When Codex spawns agents, it names them and pass them a proper prompt as well.
00:14:41In coding performance, the two are fairly similar, but the design choices behind them are different.
00:14:46Claude Code's subagents use an explicit allow list, meaning the parent agent defines exactly
00:14:51which tools the subagent can access, while Codex subagents inherit tool access from the
00:14:55parent by default.
00:14:57Claude Code also gives every subagent a completely fresh context window.
00:15:01A subagent doesn't have access to the conversation history and only sees the prompt from the parent,
00:15:06plus the system prompt and any global rules, because Claude focuses on context isolation.
00:15:10Codex CLI does the opposite.
00:15:12It forks the full history into the subagent session, with the parent's prompt layered on top.
00:15:17Codex agents retain more context about what's already been discussed, which does help improve
00:15:22their performance.
00:15:23In practice, Claude Code's strict isolation hurt our research subagents.
00:15:27When we used them, the results weren't good enough, because they only saw the immediate
00:15:30prompt and didn't have any prior context.
00:15:33Codex agents get the whole history, can iterate more effectively, and perform better on tasks
00:15:38where continuity matters.
00:15:39That brings us to the end of this video.
00:15:41If you'd like to support the channel and help us keep making videos like this, you can do
00:15:45so by using the super thanks button below.
00:15:48As always, thank you for watching and I'll see you in the next one.

Key Takeaway

GPT 5.5 in the Codex CLI delivers 2x better cost efficiency and faster autonomous debugging via its integrated agent browser, while Opus 4.7 remains superior for deep architectural planning and high-fidelity UI design.

Highlights

  • GPT 5.5 consumes 82,000 tokens to complete the same debugging task that requires 173,000 tokens from Opus 4.7.

  • Codex preserves the last 20,000 tokens in memory during compaction to prevent performance degradation, while Claude Code edits and removes reasoning steps from sessions.

  • The Rust-based Codex CLI offers a smoother UI and stable long-term sessions compared to the React-based Claude Code CLI which suffers from rendering glitches and cache leaks.

  • Claude Code version 2.1.0 and later replaced the 'dangerously skip permissions' mode with an auto-prompting system that blocks background tasks until manually approved.

  • Codex integrates directly with OpenAI image models to generate website assets within the CLI, whereas Claude Code is limited to SVG-based visual generation.

  • Opus 4.7 requires 24 minutes to plan and execute a frontend build that GPT 5.5 completes in 8 minutes.

Timeline

Usability and CLI Architecture

  • Claude Code reliability declined following the 2.1.0 update due to UI glitches and rendering issues.
  • The removal of bypass permission mode in Claude Code causes background skills to stall on hidden prompts.
  • Codex utilizes a Rust-based CLI for smoother performance and a YOLO mode that avoids repetitive permission requests.
  • Personality settings in Codex allow for direct, non-sycophantic interactions without needing a specific .md instructions file.

The React-based architecture of Claude Code leads to terminal glitches and cache leaks during extended use. In contrast, the Codex CLI maintains stability over long sessions. Codex also features a built-in skill creator and an agent browser skill, whereas Claude Code requires separate installations for structured skill creation. Claude Code retains advantages in reasoning transparency through a toggleable thinking view and a rewind feature for session history.

Cost Efficiency and Token Usage

  • Claude Code is restricted to Pro and Max plans while Codex offers a free tier with limited usage.
  • GPT 5.5 completes identical tasks using approximately 53% fewer tokens than Opus 4.7.
  • Codex maintains longer session viability because GPT 5.5 requires fewer retries to reach a working solution.

Testing on a standardized application debugging task shows that Opus 4.7 burns 173,000 tokens compared to 82,000 for GPT 5.5. Both tools utilize a 5-hour usage window, but the lower token consumption of GPT 5.5 makes it more cost-effective for large-scale applications. Claude Code provides a built-in context visualization command, while Codex requires external JSON parsing to track token usage.

Product Building and Planning Performance

  • Opus 4.7 produces more detailed project plans that include UI libraries like Shadcn UI for better user experience.
  • GPT 5.5 focuses on backend functionality and implements fallback mechanisms when API keys are missing to prevent crashes.
  • Codex utilizes an agent browser to autonomously debug applications while Claude Code relies on manual user feedback and logs.

In a greenfield mono-repo test, GPT 5.5 skips formal planning to start implementation immediately, finishing in half the time of Opus 4.7. However, the Opus 4.7 implementation features a more realistic and polished UI. During code reviews, Claude Code identifies security vulnerabilities and leaked keys but includes unnecessary code snippets, whereas Codex provides a more focused reliability report with specific line numbers.

Context and Memory Management

  • Codex preserves the most recent 20,000 tokens during compaction to ensure conversation flow remains smooth.
  • Claude Code memory is project-scoped and loses stored preferences when switching to a different repository.
  • Codex builds a global memory across all interactions to maintain consistent patterns across multiple projects.

Claude Code attempts to avoid context bloat by removing tool calls and redundant reasoning steps from the history. Codex opts for full conversation compaction but keeps the 'tail' of the conversation uncompressed in active memory. This design choice results in better performance for Codex after the context window fills up compared to the multi-step compaction used by Claude Code.

Ecosystem and Agent Orchestration

  • Claude Code supports a hook system for running custom scripts during tool execution and lifecycle events.
  • Codex sub-agents inherit the full conversation history and tool access from the parent agent for better continuity.
  • Claude Code isolates sub-agents into fresh context windows, which limits their effectiveness on complex research tasks.

The Claude ecosystem is more mature, offering cross-platform session movement between desktop, web, and mobile apps. Codex provides unique CLI features like the 'attempt' flag for generating multiple implementation variants and native image model integration for UI design. While Claude Code allows for 'ultra-think' reasoning modes, Codex's method of forking history into sub-agents proves more effective for tasks requiring deep context.

Community Posts

View all posts