I Tested DeepSeek V4 vs Claude Code vs Codex

CChase AI
Computing/SoftwareInternet Technology

Transcript

00:00:00In the last 24 hours, we have had huge updates
00:00:02to two of the biggest AI models on the planet.
00:00:04First, we got the release of GPT 5.5,
00:00:07which is boasting certain benchmark scores
00:00:10that beat out Claude's mythos.
00:00:12Secondly, we got the release of DeepSeek V4,
00:00:15which is an open source, open weight model
00:00:18that has benchmarks that rival these frontier big players.
00:00:22So with all these new models to choose from,
00:00:24what are you, the average user supposed to do?
00:00:27Well, today I'm gonna help you answer that question
00:00:29as I pit Opus 4.7, GPT 5.5,
00:00:33and DeepSeek V4 against one another,
00:00:36so you can see which one actually makes sense for you.
00:00:39Now, before we kick off this head-to-head-to-head test
00:00:41between GPT 5.5 inside of codecs,
00:00:45DeepSeek V4 inside of open code,
00:00:47and Opus 4.7 inside of Claude code,
00:00:51let's first take a quick look at the benchmarks,
00:00:53especially these two latest models
00:00:54that dropped in the last 24 hours.
00:00:56Now let's first talk about cost.
00:00:58Now, DeepSeek V4, as you know,
00:01:00is an open source, open weight model,
00:01:01but that does not mean you can run this on your computer
00:01:04because this thing is huge.
00:01:05I'm talking 1.6 trillion parameters.
00:01:08You need some serious hardware to run this.
00:01:10So we still gotta pay for it.
00:01:11We're still gonna have to use the API,
00:01:13but it is infinitely cheaper than the competition,
00:01:15about eight times cheaper.
00:01:18And of the three models,
00:01:19the brand new GPT 5.5 is actually the most expensive,
00:01:22which is kind of surprising because by and large,
00:01:24OpenAI has been cheaper than its anthropic competition.
00:01:28In terms of what it will cost you
00:01:30per 1 million tokens of output.
00:01:32For GPT 5.5, it's gonna be $30.
00:01:35For anthropic, it's going to be $25.
00:01:38And for DeepSeek, it's gonna be $3.48.
00:01:41Now, if we're talking about input tokens,
00:01:44which is a smaller part of the whole,
00:01:46GPT 5.5 and Opus 5.7 are the same.
00:01:49It's going to be $5 per 1 million input.
00:01:53And for DeepSeek, it's about like $1.70.
00:01:57So way cheaper on the input and way cheaper on the output.
00:02:01That being said, when it comes to 5.5,
00:02:03this is like twice as expensive as 5.4.
00:02:06However, OpenAI claims that it actually uses way less tokens
00:02:10due to its power.
00:02:11So while it's double the price of 5.4,
00:02:14they say in terms of actual token spend and actual cost,
00:02:17for the same task, it ends up only being like 20%
00:02:20more expensive when it's all said and done.
00:02:21So just have that in the back of your mind.
00:02:24So we've talked about the cost.
00:02:25Now let's talk about the benchmarks.
00:02:26How good are these models on paper?
00:02:27I know we're all kind of numb to benchmarks in general.
00:02:31We need to take them with a grain of salt,
00:02:32but it's still worth taking a look,
00:02:33especially when we're looking at the numbers
00:02:36that are reported by each player on the same benchmark.
00:02:39So there were three in the coding category
00:02:42that all three reported numbers.
00:02:43That was SWE bench verified, SWE bench pro
00:02:46and terminal bench 2.0.
00:02:48Now for SWE bench verified and SWE bench pro,
00:02:50Opus was the winner there.
00:02:52On terminal bench 2.0, GPT was the winner by far at 87.2,
00:02:56which by the way is a higher number
00:02:59than what Anthropic reported for Mythos.
00:03:02Oh, Mythos, sorry.
00:03:03Which is kind of crazy.
00:03:05You know, the super secret model they can't release,
00:03:07apparently does worse on terminal bench 2 than GPT 5.5.
00:03:10Now the terminal bench 2.0 is the biggest outlier here.
00:03:13Opus 4.7 and V4 Pro are way behind,
00:03:16but take a look at Opus 4.7 versus V4 Pro.
00:03:20It's less than two points while being eight times cheaper.
00:03:23And you see the same sort of story here
00:03:24with SWE bench verified and SWE bench pro.
00:03:26Yeah, Opus wins.
00:03:28But when we compare the second place with the third place
00:03:31and V4 is always third place,
00:03:33there isn't the huge gap you would expect.
00:03:36I mean, five points isn't nothing, you know,
00:03:38on SWE bench verified, 85 to 86.
00:03:41But again, eight times cheaper, open source.
00:03:45You know, there's some actual trade-offs here
00:03:46that we can make if we don't need the most power.
00:03:49Another thing that's interesting to talk about
00:03:51is long context where oddly Opus 4.7 is really bad
00:03:55by the numbers, like significantly worse than 4.6,
00:03:58which kind of blows my mind.
00:04:00And when we're talking about long context
00:04:01where we're trying to retrieve things
00:04:03between 500,000 tokens and 1 million tokens,
00:04:064.7 is actually terrible.
00:04:08And does way worse than DeepSeek and GPT 5.5.
00:04:12Now you can have a whole discussion about
00:04:14why are you even in the 500,000 to 1 million token range?
00:04:17To begin with, how many people are actually operating there
00:04:20because we are hitting context rot no matter what
00:04:22at that place, no matter what model you're using.
00:04:24But it is interesting that for whatever reason,
00:04:26we've seen some regression
00:04:27when it comes to the anthropic models.
00:04:29But big picture, I think the takeaway is
00:04:325.5 is really strong.
00:04:33It beats Opus 4.7 in certain metrics,
00:04:36loses in certain metrics,
00:04:37but it's an extremely robust model.
00:04:39And on top of that, well, V4 Pro is kind of, you know,
00:04:42lagging behind by and large.
00:04:45It's within striking distance while being infinitely cheaper,
00:04:48which again is a great option for your average customer.
00:04:52Because right now it feels like you don't have a lot
00:04:54of options on the open source side that actually can compete.
00:04:56Now let's jump into the actual head to head to head test
00:04:59with all three of these models.
00:05:00And we're using a harness for each of these models.
00:05:02With 5.5, it's going to be codecs.
00:05:04With Opus 4.7, it's going to be Claude code.
00:05:07And with DeepSeek V4 Pro, I am using open code.
00:05:10And for the first test, what we're going to do is
00:05:11we're going to have them create a flight simulator
00:05:14for us in 3JS that runs in the browser.
00:05:17You can see the prompt right here.
00:05:18I'm saying, I want it to feel good to fly.
00:05:20I want it to have some weight to it.
00:05:21I want some strong visuals and I want it to use whatever
00:05:25structure and tooling it thinks is correct.
00:05:27So it's straightforward enough that they know what to do,
00:05:30yet there's enough leeway so we can see some divergence
00:05:33between the models.
00:05:34And while we are going to look at what they're able
00:05:36to one shot, we are going to go through multiple iterations
00:05:38of this and have follow on prompts.
00:05:40Because as cool as it is to see how well it does on one shot,
00:05:44that isn't how we really work in real life, is it?
00:05:46I want to see how it does when I give it follow on prompts
00:05:49and how quickly it takes to get it to something I like.
00:05:52And when we compare these three models,
00:05:54there's really four things I'm going to look at.
00:05:55It's going to be time.
00:05:57How long does it take to build this?
00:05:58Cost, how many tokens are we using?
00:06:01Quality, how good is it?
00:06:02And then four is sort of vibes.
00:06:04And that sort of relates to quality.
00:06:06It's very subjective.
00:06:06Which one do I actually like more?
00:06:09And also of note, all three models, all three harnesses
00:06:11are also using the exact same skills.
00:06:13So let's begin with deep seeking the questions it's asking us.
00:06:16It's asking what sort of flight model we want.
00:06:18Let's go with full sim.
00:06:20It's recommending oceans and islands for the terrain.
00:06:22We'll go with that.
00:06:23Let's see how, and then it's asking camera preference.
00:06:25Let's do both.
00:06:26Let's see if it's able to give us a toggle
00:06:27for both the first person and third person.
00:06:29We'll go with its recommended tooling preference.
00:06:32And we'll just go with a low poly model
00:06:33for the aircraft and visuals itself.
00:06:35Now moving over to codecs, same sort of questions.
00:06:38Although it's only asking us three.
00:06:40Saying what kind of flight should this plan optimize for?
00:06:42Let's go with a hard simulation.
00:06:44Which playable experience matters most for the browser?
00:06:48Let's do island takeoff loop.
00:06:50It is kind of interesting how they all have the same one.
00:06:52And what camera and aircraft presentation?
00:06:54I'm gonna do toggle for this as well.
00:06:56And for Claude code, we'll do study sim learning
00:06:58for the feel ocean and islands input.
00:07:02We will do keyboard and mouse.
00:07:04It won't let it go to work.
00:07:05So plan mode by the large, very similar across all three.
00:07:09Pretty much the same questions of like,
00:07:11what do you want the physics to be?
00:07:12What do you want the terrain to be?
00:07:13What do you want the camera angle to be?
00:07:15So no huge difference there.
00:07:17And let's see what they come back with in terms of a plan.
00:07:19All right, so all three plans are complete.
00:07:20So let's go through each of them pretty quickly
00:07:22and see some of the differences.
00:07:24First one we're looking at here is DeepSeek.
00:07:26And it's pretty bare bones in terms of the plan it lays out.
00:07:29So it gives us the project structure
00:07:31and then talks very quickly about flight physics,
00:07:33environment, camera, and HUD overlay,
00:07:35and really just a few bullet points.
00:07:37On the other hand, when we're looking at 5.5 inside of codecs,
00:07:40'cause it's a summary, key changes,
00:07:43goes into implementation details, the test plan,
00:07:46and as well as the assumptions
00:07:47that spells all that out for us.
00:07:49And then we have Claude Cote's plan, which took the longest.
00:07:50Took it about five minutes, but by far is the most thorough
00:07:53'cause it's the context, the stack.
00:07:55Layout talks about the flight model.
00:07:57It's going into like the actual different moments,
00:08:00talking about stalls, like the stall buzzer.
00:08:02Like it's going very, very detailed.
00:08:03Goes into the controls, the world, the mod,
00:08:06the actual aircraft we're gonna be using, performance,
00:08:08and just keeps going on and on.
00:08:10So very detailed.
00:08:11So now we're gonna have all three implement their plan,
00:08:14and we'll see what the final result looks like.
00:08:15So GPT 5.5 inside of codecs was the first to finish.
00:08:19So let's see what it looks like.
00:08:20So here's the flight simulator it got us.
00:08:22We have some clouds in the sky.
00:08:26We have what looks like an AOA indicator up there.
00:08:31We have our speed down below,
00:08:34and let's see if we can actually get this thing
00:08:35off the ground.
00:08:36I will note there's nowhere like runway.
00:08:38It's just like straight grass.
00:08:39And instead it was gonna be like an island thing.
00:08:42Although when the camera kind of spazzes out,
00:08:45you can see the runway down below there for a second.
00:08:48All right, we're stalling out and we just,
00:08:50we can't even get off the ground, right?
00:08:51So this one's actually just a little,
00:08:54it's actually kind of difficult.
00:08:55So what I'm going to do is I'm going to give it
00:09:00a second prompt asking it to make it a little bit easier
00:09:03to fly, 'cause it has a lot going on here,
00:09:05but this is tough.
00:09:06So I wrote, it is really hard to fly.
00:09:08Can we make this easier to use?
00:09:10AKA a little bit more arcadey.
00:09:12And also the graphics could use some work.
00:09:15So let's see how that does.
00:09:16Now of note, it took 5.5 about seven minutes
00:09:21to create that first pass for us.
00:09:23And it took 63,000 tokens.
00:09:26All right, it said it made it a little bit easier
00:09:28to fly and updated the graphics.
00:09:29So let's see what the second pass looks like.
00:09:32So here's what we got.
00:09:32Graphics definitely look better,
00:09:34but let's see if we can actually get off the runway
00:09:36this time.
00:09:37So, all right, throttles at a hundred percent,
00:09:4150, 60, seven.
00:09:43What's the rotation speed on a Cessna?
00:09:46All right, 70, 80, 90.
00:09:49We gotta be able to get off the ground now.
00:09:51Okay, wrong way.
00:09:53Let's go, get off the ground, get off the ground.
00:09:56Nope, this is probably gonna stall me out, isn't it?
00:09:58Yeah, stall.
00:09:59Okay, this still needs some work.
00:10:02So let's give Codex one more shot.
00:10:05Let's give 5.5 one more chance
00:10:07to make this actually playable.
00:10:08So I told it I can't even get the aircraft
00:10:10off the ground and enter flight.
00:10:11We definitely need to make it easy to take off
00:10:12and actually fly the thing.
00:10:14Okay, so it says it fixed the takeoff problem.
00:10:16Apparently the brakes started locked on before.
00:10:19I don't know if that's why we weren't able to do it.
00:10:21Oh, it didn't automatically set it to take off.
00:10:24Flaps, yeah, this was,
00:10:25we had this on like super simulator mode.
00:10:29But here is attempt number three at our flight simulator.
00:10:32Let's see how we do.
00:10:34So can we get off the ground?
00:10:36Oh, we're bouncing on the runway
00:10:37with this time at something.
00:10:38All right, cool, we're off the ground.
00:10:41We're actually moving.
00:10:44Let's see if we can get on one of these rings.
00:10:45I mean, the graphics aren't that bad, you know,
00:10:49for something just generated in less than 10 minutes.
00:10:52It seems to be pretty accurate in terms of, you know,
00:10:56it's giving me like my vertical, you know,
00:10:59feet per minute down at the bottom,
00:11:00my actual altitude, the knots, heading, AGL.
00:11:04So like it's relatively sophisticated
00:11:06in terms of tracking everything.
00:11:08I mean, this little indicator in the front,
00:11:10I mean, looks to be like an angle of attack, you know,
00:11:13indicator, which is kind of cool.
00:11:14So it has some good stuff going on.
00:11:18The actual like controls are a little janky.
00:11:21As you can see, I can't control this for anything,
00:11:23but by and large, not bad.
00:11:25You know, we can kind of like kamikaze this
00:11:27and see what happens at, you know, 18,000 feet per minute.
00:11:31But yeah, you know, for 66,000 tokens,
00:11:36about 10 minutes, 15 minutes or so, give or take,
00:11:40you know, with the back and forth,
00:11:41I don't think that's bad at all.
00:11:42So now let's take a look at DeepSeek.
00:11:44It took about 10 minutes to do this.
00:11:46And in terms of tokens, 63,000 and 44 cents.
00:11:51So 44 cents, 10 minutes.
00:11:53And here is what DeepSeek came up with for us.
00:11:56I have no idea.
00:12:00What I'm looking at.
00:12:03This is supposed to be third person.
00:12:06This is supposed to be the cockpit.
00:12:07And obviously our first pass with DeepSeek
00:12:11was another disaster.
00:12:13So I'm telling DeepSeek the simulator is a complete mess.
00:12:16The graphics are completely buggy
00:12:17and I cannot fly anything.
00:12:20Please fix.
00:12:21And here's what our second pass looks like.
00:12:24I still have no idea.
00:12:26Absolutely no clue.
00:12:28What the heck DeepSeek is.
00:12:30Oh, hey, there's a plane.
00:12:32Oh, there's something.
00:12:33I, yeah, this is, this is brutal.
00:12:38And to be honest, I feel like even giving it another prompt
00:12:42to do this, I would need to start getting very, very specific
00:12:44about what we're trying to do, which again,
00:12:47like falls pretty short of what we did with Codex.
00:12:49Like it was very, you know, kind of bland prompts.
00:12:51I was able to get something at least close,
00:12:53even on the first pass.
00:12:54Like this clearly it's completely struggling
00:12:57with the graphics.
00:12:58We are just, I don't even know how to describe this,
00:13:01but hey, it was super cheap.
00:13:03So now let's take a look at what Claude Code
00:13:07was able to give us for reference.
00:13:09It took 13 minutes to actually execute the plan.
00:13:12The plan itself took five minutes.
00:13:13So let's call it 20 minutes to come up with the first pass.
00:13:17And then for total tokens,
00:13:19this run took about 15% plus the 5% before the plan.
00:13:22So we're looking at, well, sorry,
00:13:24we are looking at 11% context plus 5% before.
00:13:28So call it 20 minutes, 150,000 tokens for Claude Code,
00:13:33which is definitely the most expensive
00:13:34and slowest out of all of them.
00:13:36And here is Claude Code's attempt at this.
00:13:39For whatever reason, we are instantly in the air.
00:13:43We are stalling.
00:13:44We are an IFR.
00:13:45I don't know what's happening.
00:13:48We are about to crash something.
00:13:50Can we save this?
00:13:51Can we pull this out of a dive?
00:13:53No, we're stalling, no, we're dead.
00:13:54Okay, that's interesting.
00:13:56Again, it instantly slingshots us into the air.
00:14:00We are in the clouds.
00:14:02We are stalling.
00:14:03I don't know what is happening.
00:14:05We need, we need a second pass.
00:14:08So I wrote upon loading, I'm instantly thrown into the air.
00:14:11It's hard to control.
00:14:12I want to start on the runway and I want it easier to fly.
00:14:15Oh, and by the way, improve those graphics too.
00:14:17So it took about four minutes, but it made some changes.
00:14:20We're going to spawn on the runway.
00:14:22It changed the gear.
00:14:23So now it's tricycle gear and a few other stuff.
00:14:24So let's see what it looks like.
00:14:26Right, so here it is.
00:14:27Again, we are thrown immediately into a fog bank.
00:14:29I'm trying to control this thing.
00:14:31And I just, yeah, there's no controlling this at all.
00:14:33All right, we are going to give,
00:14:34we're going to give cloud code one more chance here.
00:14:37So I told it it's still instantly slingshotting me
00:14:39into the sky.
00:14:40I said, let's go with a much more arcade type feel
00:14:42with the controls.
00:14:43I think we probably should have done that
00:14:44with the initial prompts for all three.
00:14:46I think going for a more realistic SIM type thing,
00:14:50it really struggles to,
00:14:53I think do that in a way where it's still user-friendly.
00:14:57I think it's probably doing a good job under the hood
00:14:59in terms of like, okay, like angle of attack.
00:15:01All right, you're stalling at this, you know,
00:15:02angle versus the speed and all that.
00:15:04But actually manipulating this from the computer
00:15:07is basically impossible.
00:15:09Although I think the fog stuff is really strange.
00:15:12So let's see if after the second round of prompts
00:15:15it's able to do a little bit better
00:15:16because right now GPT 5.5 did much, much better.
00:15:20So cloud code made some more changes,
00:15:22made it more user-friendly.
00:15:23And let's see if I'm still going
00:15:24for my instrument rating this time.
00:15:26So yep, we're still going.
00:15:28We're still going for instrument rating.
00:15:30We're at men's here, but you know, I can kind of see it.
00:15:33You know, I can check my instrument panel.
00:15:35All right, we're coming off the runway.
00:15:37Yeah, okay.
00:15:42Can I, why is there a tree in the runway?
00:15:44I'm trying to go up.
00:15:46Can I go up?
00:15:47Can I pitch?
00:15:49Click canvas to lock mouse, what?
00:15:53Oh, we're in the air.
00:15:54Nope, nope, we died.
00:15:57So yeah, I think this one is pretty clear.
00:16:02GPT 5.5, easily the winner, I think.
00:16:06Cloud code was second place.
00:16:08I would give it second place.
00:16:10You know, it definitely struggled
00:16:13even with the prompts we gave it.
00:16:14We didn't give it great prompts, let's be totally honest.
00:16:16I think given more time, better prompts,
00:16:19a few more back and forths,
00:16:20we could have got it to where we want it to go.
00:16:21Like it was, at least it had an aircraft, it had a runway.
00:16:25It had trees in the runway,
00:16:26but it had the actual things we needed
00:16:29versus DeepSeek with OpenCODE.
00:16:32I had no idea what was going on there.
00:16:34That was a complete mess.
00:16:35I feel like I would have had to start over
00:16:36from the beginning, like give it a very specific prompt.
00:16:38Like it wasn't even close to being messed with,
00:16:39but GPT 5.5 right off the rip, you know,
00:16:42it was pretty vague prompts.
00:16:44I thought it did really good.
00:16:455.5 also used the total of 66K tokens.
00:16:48We're looking at over here with Opus all together,
00:16:52about 200,000 tokens.
00:16:53So quarter of the tokens, essentially quarter of the cost.
00:16:56And it was a bit faster.
00:16:58I mean, at this point, I don't even care
00:16:59about how OpenCODE actually took longer than GPT 5.5 as well.
00:17:03And it just sucked, let's just be honest, it just sucked.
00:17:07Now let's move on to test number two.
00:17:10This time we are going to be asking them
00:17:12to create a landing page that shows off WebGPU shader work
00:17:16using 3JS.
00:17:18Now WebGPU shader work is the kind of stuff you see
00:17:21on awards websites.
00:17:23I'm talking websites like Igloo, this kind of thing,
00:17:26like very high-end graphics.
00:17:28It looks like a video game.
00:17:29It's essentially using your computer's graphics card
00:17:32to render all this stuff.
00:17:34Now I don't expect any of these to get anything even close
00:17:37to what we see here, but I want to see what they can do
00:17:40using essentially the shaders technology.
00:17:42This is definitely a step above your basic
00:17:45SaaS templated landing page.
00:17:46I want to see what they can do and push them
00:17:48to the limits in the world of web design.
00:17:50Now I've given all of them a skill that actually breaks down
00:17:53how to do this sort of thing.
00:17:55So it's not like they're completely in the dark
00:17:57and one also doesn't have an advantage over the other.
00:18:00The only thing I've told them is I want it to feel modern
00:18:02and visually striking, something you would see on awards
00:18:05and to make smart use of GPU compute.
00:18:08So they can pick whatever stack and project structure
00:18:10they like and use good judgment on hero concept,
00:18:13UI and interactions.
00:18:15And just like the first test, they're all on plan mode.
00:18:17So let's get started.
00:18:18Okay, so they all finished their plan and funny enough,
00:18:21none of them asked me any questions,
00:18:22even though we put them in plan mode.
00:18:24So let's take a look at GPT 5.5 first.
00:18:28So it's telling us it's going to do a full bleed
00:18:30interactive GPU driven hero.
00:18:32The concept will be a living signal field
00:18:34with some like dense particle thing it's going to do.
00:18:36We'll see what that ends up looking like.
00:18:38And overall it's a minimal awards style landing copy.
00:18:41Fully interactive web GPU scene
00:18:43with pointer reactive compute simulation.
00:18:46All right, for DeepSeek it's a pretty short and sweet plan,
00:18:50just like we saw with the flight simulator.
00:18:53Hopefully we get a better output this time,
00:18:54but a hero section with 75,000 GPU computer particles.
00:18:58I am kind of guessing that all of them are going to go
00:19:01for some sort of like particle theme on the hero.
00:19:04So it's going to have mouse interaction, integration.
00:19:08It'll have a one-time initialization.
00:19:10And then we should see stuff like bloom,
00:19:13chromatic aberration, a custom vignette and some film grain.
00:19:16So we'll see what that actually ends up looking like.
00:19:19And then we have Opus 4.7 plan again,
00:19:21going for this particle thing with bloom
00:19:23and it's going to be interactive with the mouse.
00:19:25So we'll see if any of these actually look different
00:19:27because on the surface, all their plans sound very similar.
00:19:29So the first one done was 5.5.
00:19:32It took about six minutes.
00:19:34And in terms of tokens, we've used 107K.
00:19:37So let's see what it built us.
00:19:40And here's what it created for us.
00:19:42Now, this is very bright.
00:19:45So it's hard to even see the actual particles,
00:19:47but you know, as we scroll up and down,
00:19:50it does have an animation going on in the background
00:19:52as well as, you know, some subtle color changes.
00:19:56It looks like right now our mouse is supposed
00:20:00to attract the particles.
00:20:01And we have, I'll move this over here.
00:20:03It gave some options for like repelling it versus drift.
00:20:08But again, it's kind of tough to see it
00:20:11due to how bright it is.
00:20:12So I told it it's hard to actually see the particles
00:20:14due to the brightness.
00:20:14It also takes over a lot of the hero tech.
00:20:16So can we turn down the brightness a bit
00:20:18and also push it to the right a bit more?
00:20:20Because right now it is kind of overpowering.
00:20:23You can't even really read the text over here on the left
00:20:25due to just how freaking bright these particles are.
00:20:27And here's the update after the second run.
00:20:30It's a little bit better.
00:20:31It isn't as overpowering and leaves some room for the text.
00:20:35Although I will say it's kind of blurry almost,
00:20:39but you know, it's not bad.
00:20:41Like it's set out to do what we told it to do
00:20:44given the somewhat vague problem.
00:20:46So I'm not blown away by sort of the design it came up with,
00:20:49but I'm not like upset about it.
00:20:51Now let's take a look at Claude Code
00:20:52because as we've been doing all this,
00:20:55DeepSeek is still over here in the trenches
00:20:57trying to figure this out.
00:20:58And here's what Claude Code gave us.
00:21:01So kind of nothing.
00:21:06I'm not sure if it's saying the background,
00:21:10I guess the entire background is supposed to be
00:21:14the WebGL, I'm assuming.
00:21:19It's very understated,
00:21:21which I guess is something you could totally do.
00:21:24I mean, like on screen it doesn't look,
00:21:25like it looks kind of cool, but I'll be honest,
00:21:28I was looking for something a little more flashy.
00:21:31So on the second pass,
00:21:31when I told it to make it a bit more flashy,
00:21:34there wasn't a huge difference.
00:21:35Although like it's really subtle.
00:21:38There's kind of like this film grain,
00:21:40almost like this blur that goes from bottom to top.
00:21:43So it's a pretty subtle thing.
00:21:45And you can see here on the bottom,
00:21:47it tracks like the frames per second.
00:21:49It's using 250,000 particles.
00:21:51So, I mean, honestly it looks cool.
00:21:54It's just not super flashy.
00:21:56So it's definitely like a taste thing.
00:21:58Now total tokens on the Cloud Code side was about 175,000,
00:22:01and it took just slightly longer than 5.5 inside of Codex.
00:22:05Now let's take a look at DeepSeek,
00:22:07which has taken 116,000 tokens at this point.
00:22:10It took the longest as well,
00:22:12but total costs we're talking again, under a dollar.
00:22:15And here's what it gave us.
00:22:17So it's kind of this particle field thing
00:22:21that somewhat follows my mouse.
00:22:25Interesting.
00:22:27I think it might give you like an epileptic seizure.
00:22:29Honestly, beyond that, it's pretty bland.
00:22:35The flux, you know, X-ray here kind of changes colors,
00:22:39but yeah, pretty much just created this thing.
00:22:43After telling DeepSeek to do another pass,
00:22:45it then came back with this,
00:22:46where now it kind of has like some weird parallax thing.
00:22:49It's got some like blue stuff going on in the background.
00:22:53And now this thing that's like a UFO,
00:22:55which kind of responds to your mouse,
00:22:58but yeah, it's something.
00:23:02And overall, the token count from DeepSeek was 130K tokens
00:23:05coming in at $1.43.
00:23:08So after all those tests, where does that really leave us?
00:23:13So now let's talk about the final results.
00:23:15When it comes to test number one,
00:23:16which was the flight simulator, clear winner.
00:23:18That was GPT 5.5 inside of Codex.
00:23:21It was quicker than Opus 4.7 inside of Claude Code.
00:23:25It was also faster and the end result was by far the best.
00:23:29DeepSeek did terribly in the flight simulator.
00:23:32It wasn't even close to what we were trying to do.
00:23:34I would have had to continue to prompt it,
00:23:35prompt it, prompt it to even get it to like close
00:23:38to the first pass from 5.5 and Opus 4.7 and Claude Code
00:23:43was like, eh, it wasn't awful.
00:23:46Like it really didn't work at the beginning,
00:23:48but after a couple of prompts, you could tell,
00:23:50we could get it to a place where it was equivalent
00:23:52to what GPT 5.5 was doing.
00:23:54That would have taken more prompts.
00:23:55It would have taken more time
00:23:57and ultimately it would be more expensive.
00:23:59So clear winner for 5.5.
00:24:01In terms of the web GPU landing page,
00:24:03again, DeepSeek struggled here.
00:24:04I was not a fan of this.
00:24:06I don't really know what this is supposed to be.
00:24:08Sure, I didn't give it a super great prompt,
00:24:10but like, is this what we're gonna be getting
00:24:13as a baseline median outcome?
00:24:16If I don't like grab DeepSeek by the reins
00:24:19and really force it to do something, I guess so.
00:24:22Now, when we compare Opus in 5.5,
00:24:24I would have gone with Opus 4.7 and Claude Code
00:24:27with how it handled the web GPU thing.
00:24:29I think that has to do with sort of a taste kind of deal.
00:24:31Yeah, you could argue the 5.5 was flashier,
00:24:35but I thought it was kind of ugly.
00:24:37Again, in all these tests, we kept the prompts rather vague
00:24:41to see what sort of path it would go down.
00:24:43So I would definitely give Opus the lead here,
00:24:46although it was more expensive
00:24:48and it also took slightly longer.
00:24:50So if they were given a more hands-on prompt
00:24:55that was very specific about what you wanted to do,
00:24:57because 5.5 did what we wanted it to do.
00:24:59Like it did create a web GPU landing page.
00:25:02I just thought it was ugly.
00:25:04So it still completed the task.
00:25:06It just didn't complete it as well, I think, as Opus.
00:25:08Now, big picture, what does it mean
00:25:09if we take all that together?
00:25:11Well, I think it means great news
00:25:13for anybody who's using agent decoders.
00:25:16We have options, right?
00:25:18You can use Opus and Clod code,
00:25:20or you can use GPT 5.5 and codecs.
00:25:23You're not wrong with either.
00:25:25I think it's totally a personal preference at this point.
00:25:28And the best part is if you go down the Clod code route,
00:25:31it pretty much all applies to codecs.
00:25:33If you go down the codecs route,
00:25:34it pretty much all applies to Clod code.
00:25:37So I don't really think there's vendor lock in the sense like,
00:25:40oh, I've only learned about Clod code.
00:25:42Like I can't go to codecs or vice versa.
00:25:44That's not the case at all.
00:25:45If you're doing this the right way,
00:25:46what you're really learning is AI fundamentals
00:25:48and how to build things.
00:25:49And that applies to both of these guys.
00:25:51And the more competition,
00:25:53the better it is for us, the consumer.
00:25:54Now, as for DeepSeek, eh, I don't know.
00:25:59I wasn't very impressed.
00:26:00This might be a situation where like, okay,
00:26:02like DeepSeek makes sense if we're doing simpler tasks
00:26:04where we just don't need the power of something like Opus,
00:26:06or we just don't need the power of something like GPT 5.5.
00:26:10Because remember, we're talking about something
00:26:11that is eight times cheaper.
00:26:13Sure, I didn't like the WebGPU landing pages
00:26:16thing came up with, but was it eight times worse?
00:26:19Maybe, maybe not.
00:26:21Kind of hard to actually, you know,
00:26:23articulate that and quantify that.
00:26:24But obviously that's something we need to take into account.
00:26:27So, you know, I don't think it's really competition
00:26:30to be frank with 4.7 or 5.5.
00:26:33I think though, if you're doing simpler tasks
00:26:35and you're like very token conscious, very cash conscious,
00:26:38then hey, maybe DeepSeek makes sense for you.
00:26:41So that's all I got for you guys today.
00:26:42I hope that sheds some light on these three models
00:26:45and how they kind of stack up to one another.
00:26:47I think it's a great time to be in the space.
00:26:49More competition is better for everyone.
00:26:51So as always, if you want to get your hands
00:26:53on the Claude Code Masterclass,
00:26:55make sure to check out Chase AI Plus.
00:26:56There's a link to that in the description.
00:26:58And I'll see you around.

Key Takeaway

GPT 5.5 inside Codex is currently the superior choice for agentic coding workflows, offering higher success rates in complex tasks like flight simulation and WebGPU development compared to Opus 4.7 and DeepSeek V4.

Highlights

  • GPT 5.5 costs $30 per 1 million output tokens, while Anthropic's Opus 4.7 costs $25 and DeepSeek V4 costs $3.48.

  • GPT 5.5 achieved a 87.2 score on Terminal Bench 2.0, outperforming Anthropic's unreleased model.

  • DeepSeek V4, despite being an open-weight model with 1.6 trillion parameters, performed poorly in hands-on coding tests for flight simulation and WebGPU graphics.

  • Claude Code was the slowest and most expensive option, consuming roughly 200,000 tokens for the flight simulator project.

  • GPT 5.5 inside Codex demonstrated the best balance of speed, cost, and output quality during multi-iteration coding tasks.

  • Anthropic's Opus 4.7 exhibited significant performance regressions in long-context retrieval tasks between 500,000 and 1 million tokens.

Timeline

Model Benchmarks and Cost Analysis

  • GPT 5.5 is the most expensive model at $30 per million output tokens, compared to $25 for Opus 4.7 and $3.48 for DeepSeek V4.
  • Terminal Bench 2.0 results show GPT 5.5 scoring 87.2, exceeding the performance of the latest internal Anthropic models.
  • Opus 4.7 shows unexpected performance regression in long-context retrieval scenarios exceeding 500,000 tokens.

The release of GPT 5.5 and DeepSeek V4 creates a new landscape for users. While DeepSeek V4 is significantly cheaper, the analysis notes it requires massive hardware for local inference and is primarily accessed via API. Despite OpenAI's higher token costs, the model's efficiency often results in lower total task costs compared to previous versions. Benchmarks indicate that while Opus 4.7 remains competitive in SWE bench tasks, it struggles in long-context retrieval compared to GPT 5.5.

Flight Simulator Implementation Test

  • GPT 5.5 completed the flight simulator task efficiently, requiring only 66,000 tokens.
  • DeepSeek V4 failed to produce a functional simulator, resulting in broken graphics and unusable controls.
  • Claude Code provided the most detailed project plan but was the most expensive and slowest to execute.

Each model was tasked with building a 3JS flight simulator in the browser using specific agent harnesses (Codex, Claude Code, OpenCode). GPT 5.5 demonstrated the best initial output and was the most responsive to follow-up prompts for adjustments. DeepSeek V4 struggled significantly with basic graphics and structural requirements. Claude Code was thorough in its planning phase, taking five minutes for a detailed roadmap, but execution was hampered by high token usage and performance bugs.

WebGPU Landing Page Performance

  • GPT 5.5 produced a functional WebGPU particle hero section that was visually bright but technically accurate.
  • Claude Code opted for a subtle, understated aesthetic that outperformed the competitors in design preference.
  • DeepSeek V4 again failed to provide a viable solution, resulting in jittery, low-quality graphical output.

The models were tested on their ability to create high-end, award-style WebGPU shader work. GPT 5.5 delivered the required code quickly but needed refinement due to excessive brightness. Claude Code produced a more refined aesthetic, though it was computationally heavier. DeepSeek V4 remained the least effective model, producing visuals that were described as bland and technically unstable.

Comparison Summary and Recommendations

  • GPT 5.5 is the clear winner for complex coding tasks due to its superior execution of vague prompts.
  • Opus 4.7 is a viable alternative for users who prefer its design outputs, despite higher costs.
  • DeepSeek V4 is only recommended for users performing simple tasks who are strictly budget-constrained.

The final assessment ranks GPT 5.5 as the most robust model for agentic coding. While Opus 4.7 is suitable for those preferring its design quality, it remains more expensive and slower. DeepSeek V4 fails to provide competitive results for sophisticated development tasks, functioning best only in low-complexity scenarios where extreme cost-savings are the primary objective.

Community Posts

View all posts