Transcript
00:00:00In the last 24 hours, we have had huge updates
00:00:02to two of the biggest AI models on the planet.
00:00:04First, we got the release of GPT 5.5,
00:00:07which is boasting certain benchmark scores
00:00:10that beat out Claude's mythos.
00:00:12Secondly, we got the release of DeepSeek V4,
00:00:15which is an open source, open weight model
00:00:18that has benchmarks that rival these frontier big players.
00:00:22So with all these new models to choose from,
00:00:24what are you, the average user supposed to do?
00:00:27Well, today I'm gonna help you answer that question
00:00:29as I pit Opus 4.7, GPT 5.5,
00:00:33and DeepSeek V4 against one another,
00:00:36so you can see which one actually makes sense for you.
00:00:39Now, before we kick off this head-to-head-to-head test
00:00:41between GPT 5.5 inside of codecs,
00:00:45DeepSeek V4 inside of open code,
00:00:47and Opus 4.7 inside of Claude code,
00:00:51let's first take a quick look at the benchmarks,
00:00:53especially these two latest models
00:00:54that dropped in the last 24 hours.
00:00:56Now let's first talk about cost.
00:00:58Now, DeepSeek V4, as you know,
00:01:00is an open source, open weight model,
00:01:01but that does not mean you can run this on your computer
00:01:04because this thing is huge.
00:01:05I'm talking 1.6 trillion parameters.
00:01:08You need some serious hardware to run this.
00:01:10So we still gotta pay for it.
00:01:11We're still gonna have to use the API,
00:01:13but it is infinitely cheaper than the competition,
00:01:15about eight times cheaper.
00:01:18And of the three models,
00:01:19the brand new GPT 5.5 is actually the most expensive,
00:01:22which is kind of surprising because by and large,
00:01:24OpenAI has been cheaper than its anthropic competition.
00:01:28In terms of what it will cost you
00:01:30per 1 million tokens of output.
00:01:32For GPT 5.5, it's gonna be $30.
00:01:35For anthropic, it's going to be $25.
00:01:38And for DeepSeek, it's gonna be $3.48.
00:01:41Now, if we're talking about input tokens,
00:01:44which is a smaller part of the whole,
00:01:46GPT 5.5 and Opus 5.7 are the same.
00:01:49It's going to be $5 per 1 million input.
00:01:53And for DeepSeek, it's about like $1.70.
00:01:57So way cheaper on the input and way cheaper on the output.
00:02:01That being said, when it comes to 5.5,
00:02:03this is like twice as expensive as 5.4.
00:02:06However, OpenAI claims that it actually uses way less tokens
00:02:10due to its power.
00:02:11So while it's double the price of 5.4,
00:02:14they say in terms of actual token spend and actual cost,
00:02:17for the same task, it ends up only being like 20%
00:02:20more expensive when it's all said and done.
00:02:21So just have that in the back of your mind.
00:02:24So we've talked about the cost.
00:02:25Now let's talk about the benchmarks.
00:02:26How good are these models on paper?
00:02:27I know we're all kind of numb to benchmarks in general.
00:02:31We need to take them with a grain of salt,
00:02:32but it's still worth taking a look,
00:02:33especially when we're looking at the numbers
00:02:36that are reported by each player on the same benchmark.
00:02:39So there were three in the coding category
00:02:42that all three reported numbers.
00:02:43That was SWE bench verified, SWE bench pro
00:02:46and terminal bench 2.0.
00:02:48Now for SWE bench verified and SWE bench pro,
00:02:50Opus was the winner there.
00:02:52On terminal bench 2.0, GPT was the winner by far at 87.2,
00:02:56which by the way is a higher number
00:02:59than what Anthropic reported for Mythos.
00:03:02Oh, Mythos, sorry.
00:03:03Which is kind of crazy.
00:03:05You know, the super secret model they can't release,
00:03:07apparently does worse on terminal bench 2 than GPT 5.5.
00:03:10Now the terminal bench 2.0 is the biggest outlier here.
00:03:13Opus 4.7 and V4 Pro are way behind,
00:03:16but take a look at Opus 4.7 versus V4 Pro.
00:03:20It's less than two points while being eight times cheaper.
00:03:23And you see the same sort of story here
00:03:24with SWE bench verified and SWE bench pro.
00:03:26Yeah, Opus wins.
00:03:28But when we compare the second place with the third place
00:03:31and V4 is always third place,
00:03:33there isn't the huge gap you would expect.
00:03:36I mean, five points isn't nothing, you know,
00:03:38on SWE bench verified, 85 to 86.
00:03:41But again, eight times cheaper, open source.
00:03:45You know, there's some actual trade-offs here
00:03:46that we can make if we don't need the most power.
00:03:49Another thing that's interesting to talk about
00:03:51is long context where oddly Opus 4.7 is really bad
00:03:55by the numbers, like significantly worse than 4.6,
00:03:58which kind of blows my mind.
00:04:00And when we're talking about long context
00:04:01where we're trying to retrieve things
00:04:03between 500,000 tokens and 1 million tokens,
00:04:064.7 is actually terrible.
00:04:08And does way worse than DeepSeek and GPT 5.5.
00:04:12Now you can have a whole discussion about
00:04:14why are you even in the 500,000 to 1 million token range?
00:04:17To begin with, how many people are actually operating there
00:04:20because we are hitting context rot no matter what
00:04:22at that place, no matter what model you're using.
00:04:24But it is interesting that for whatever reason,
00:04:26we've seen some regression
00:04:27when it comes to the anthropic models.
00:04:29But big picture, I think the takeaway is
00:04:325.5 is really strong.
00:04:33It beats Opus 4.7 in certain metrics,
00:04:36loses in certain metrics,
00:04:37but it's an extremely robust model.
00:04:39And on top of that, well, V4 Pro is kind of, you know,
00:04:42lagging behind by and large.
00:04:45It's within striking distance while being infinitely cheaper,
00:04:48which again is a great option for your average customer.
00:04:52Because right now it feels like you don't have a lot
00:04:54of options on the open source side that actually can compete.
00:04:56Now let's jump into the actual head to head to head test
00:04:59with all three of these models.
00:05:00And we're using a harness for each of these models.
00:05:02With 5.5, it's going to be codecs.
00:05:04With Opus 4.7, it's going to be Claude code.
00:05:07And with DeepSeek V4 Pro, I am using open code.
00:05:10And for the first test, what we're going to do is
00:05:11we're going to have them create a flight simulator
00:05:14for us in 3JS that runs in the browser.
00:05:17You can see the prompt right here.
00:05:18I'm saying, I want it to feel good to fly.
00:05:20I want it to have some weight to it.
00:05:21I want some strong visuals and I want it to use whatever
00:05:25structure and tooling it thinks is correct.
00:05:27So it's straightforward enough that they know what to do,
00:05:30yet there's enough leeway so we can see some divergence
00:05:33between the models.
00:05:34And while we are going to look at what they're able
00:05:36to one shot, we are going to go through multiple iterations
00:05:38of this and have follow on prompts.
00:05:40Because as cool as it is to see how well it does on one shot,
00:05:44that isn't how we really work in real life, is it?
00:05:46I want to see how it does when I give it follow on prompts
00:05:49and how quickly it takes to get it to something I like.
00:05:52And when we compare these three models,
00:05:54there's really four things I'm going to look at.
00:05:55It's going to be time.
00:05:57How long does it take to build this?
00:05:58Cost, how many tokens are we using?
00:06:01Quality, how good is it?
00:06:02And then four is sort of vibes.
00:06:04And that sort of relates to quality.
00:06:06It's very subjective.
00:06:06Which one do I actually like more?
00:06:09And also of note, all three models, all three harnesses
00:06:11are also using the exact same skills.
00:06:13So let's begin with deep seeking the questions it's asking us.
00:06:16It's asking what sort of flight model we want.
00:06:18Let's go with full sim.
00:06:20It's recommending oceans and islands for the terrain.
00:06:22We'll go with that.
00:06:23Let's see how, and then it's asking camera preference.
00:06:25Let's do both.
00:06:26Let's see if it's able to give us a toggle
00:06:27for both the first person and third person.
00:06:29We'll go with its recommended tooling preference.
00:06:32And we'll just go with a low poly model
00:06:33for the aircraft and visuals itself.
00:06:35Now moving over to codecs, same sort of questions.
00:06:38Although it's only asking us three.
00:06:40Saying what kind of flight should this plan optimize for?
00:06:42Let's go with a hard simulation.
00:06:44Which playable experience matters most for the browser?
00:06:48Let's do island takeoff loop.
00:06:50It is kind of interesting how they all have the same one.
00:06:52And what camera and aircraft presentation?
00:06:54I'm gonna do toggle for this as well.
00:06:56And for Claude code, we'll do study sim learning
00:06:58for the feel ocean and islands input.
00:07:02We will do keyboard and mouse.
00:07:04It won't let it go to work.
00:07:05So plan mode by the large, very similar across all three.
00:07:09Pretty much the same questions of like,
00:07:11what do you want the physics to be?
00:07:12What do you want the terrain to be?
00:07:13What do you want the camera angle to be?
00:07:15So no huge difference there.
00:07:17And let's see what they come back with in terms of a plan.
00:07:19All right, so all three plans are complete.
00:07:20So let's go through each of them pretty quickly
00:07:22and see some of the differences.
00:07:24First one we're looking at here is DeepSeek.
00:07:26And it's pretty bare bones in terms of the plan it lays out.
00:07:29So it gives us the project structure
00:07:31and then talks very quickly about flight physics,
00:07:33environment, camera, and HUD overlay,
00:07:35and really just a few bullet points.
00:07:37On the other hand, when we're looking at 5.5 inside of codecs,
00:07:40'cause it's a summary, key changes,
00:07:43goes into implementation details, the test plan,
00:07:46and as well as the assumptions
00:07:47that spells all that out for us.
00:07:49And then we have Claude Cote's plan, which took the longest.
00:07:50Took it about five minutes, but by far is the most thorough
00:07:53'cause it's the context, the stack.
00:07:55Layout talks about the flight model.
00:07:57It's going into like the actual different moments,
00:08:00talking about stalls, like the stall buzzer.
00:08:02Like it's going very, very detailed.
00:08:03Goes into the controls, the world, the mod,
00:08:06the actual aircraft we're gonna be using, performance,
00:08:08and just keeps going on and on.
00:08:10So very detailed.
00:08:11So now we're gonna have all three implement their plan,
00:08:14and we'll see what the final result looks like.
00:08:15So GPT 5.5 inside of codecs was the first to finish.
00:08:19So let's see what it looks like.
00:08:20So here's the flight simulator it got us.
00:08:22We have some clouds in the sky.
00:08:26We have what looks like an AOA indicator up there.
00:08:31We have our speed down below,
00:08:34and let's see if we can actually get this thing
00:08:35off the ground.
00:08:36I will note there's nowhere like runway.
00:08:38It's just like straight grass.
00:08:39And instead it was gonna be like an island thing.
00:08:42Although when the camera kind of spazzes out,
00:08:45you can see the runway down below there for a second.
00:08:48All right, we're stalling out and we just,
00:08:50we can't even get off the ground, right?
00:08:51So this one's actually just a little,
00:08:54it's actually kind of difficult.
00:08:55So what I'm going to do is I'm going to give it
00:09:00a second prompt asking it to make it a little bit easier
00:09:03to fly, 'cause it has a lot going on here,
00:09:05but this is tough.
00:09:06So I wrote, it is really hard to fly.
00:09:08Can we make this easier to use?
00:09:10AKA a little bit more arcadey.
00:09:12And also the graphics could use some work.
00:09:15So let's see how that does.
00:09:16Now of note, it took 5.5 about seven minutes
00:09:21to create that first pass for us.
00:09:23And it took 63,000 tokens.
00:09:26All right, it said it made it a little bit easier
00:09:28to fly and updated the graphics.
00:09:29So let's see what the second pass looks like.
00:09:32So here's what we got.
00:09:32Graphics definitely look better,
00:09:34but let's see if we can actually get off the runway
00:09:36this time.
00:09:37So, all right, throttles at a hundred percent,
00:09:4150, 60, seven.
00:09:43What's the rotation speed on a Cessna?
00:09:46All right, 70, 80, 90.
00:09:49We gotta be able to get off the ground now.
00:09:51Okay, wrong way.
00:09:53Let's go, get off the ground, get off the ground.
00:09:56Nope, this is probably gonna stall me out, isn't it?
00:09:58Yeah, stall.
00:09:59Okay, this still needs some work.
00:10:02So let's give Codex one more shot.
00:10:05Let's give 5.5 one more chance
00:10:07to make this actually playable.
00:10:08So I told it I can't even get the aircraft
00:10:10off the ground and enter flight.
00:10:11We definitely need to make it easy to take off
00:10:12and actually fly the thing.
00:10:14Okay, so it says it fixed the takeoff problem.
00:10:16Apparently the brakes started locked on before.
00:10:19I don't know if that's why we weren't able to do it.
00:10:21Oh, it didn't automatically set it to take off.
00:10:24Flaps, yeah, this was,
00:10:25we had this on like super simulator mode.
00:10:29But here is attempt number three at our flight simulator.
00:10:32Let's see how we do.
00:10:34So can we get off the ground?
00:10:36Oh, we're bouncing on the runway
00:10:37with this time at something.
00:10:38All right, cool, we're off the ground.
00:10:41We're actually moving.
00:10:44Let's see if we can get on one of these rings.
00:10:45I mean, the graphics aren't that bad, you know,
00:10:49for something just generated in less than 10 minutes.
00:10:52It seems to be pretty accurate in terms of, you know,
00:10:56it's giving me like my vertical, you know,
00:10:59feet per minute down at the bottom,
00:11:00my actual altitude, the knots, heading, AGL.
00:11:04So like it's relatively sophisticated
00:11:06in terms of tracking everything.
00:11:08I mean, this little indicator in the front,
00:11:10I mean, looks to be like an angle of attack, you know,
00:11:13indicator, which is kind of cool.
00:11:14So it has some good stuff going on.
00:11:18The actual like controls are a little janky.
00:11:21As you can see, I can't control this for anything,
00:11:23but by and large, not bad.
00:11:25You know, we can kind of like kamikaze this
00:11:27and see what happens at, you know, 18,000 feet per minute.
00:11:31But yeah, you know, for 66,000 tokens,
00:11:36about 10 minutes, 15 minutes or so, give or take,
00:11:40you know, with the back and forth,
00:11:41I don't think that's bad at all.
00:11:42So now let's take a look at DeepSeek.
00:11:44It took about 10 minutes to do this.
00:11:46And in terms of tokens, 63,000 and 44 cents.
00:11:51So 44 cents, 10 minutes.
00:11:53And here is what DeepSeek came up with for us.
00:11:56I have no idea.
00:12:00What I'm looking at.
00:12:03This is supposed to be third person.
00:12:06This is supposed to be the cockpit.
00:12:07And obviously our first pass with DeepSeek
00:12:11was another disaster.
00:12:13So I'm telling DeepSeek the simulator is a complete mess.
00:12:16The graphics are completely buggy
00:12:17and I cannot fly anything.
00:12:20Please fix.
00:12:21And here's what our second pass looks like.
00:12:24I still have no idea.
00:12:26Absolutely no clue.
00:12:28What the heck DeepSeek is.
00:12:30Oh, hey, there's a plane.
00:12:32Oh, there's something.
00:12:33I, yeah, this is, this is brutal.
00:12:38And to be honest, I feel like even giving it another prompt
00:12:42to do this, I would need to start getting very, very specific
00:12:44about what we're trying to do, which again,
00:12:47like falls pretty short of what we did with Codex.
00:12:49Like it was very, you know, kind of bland prompts.
00:12:51I was able to get something at least close,
00:12:53even on the first pass.
00:12:54Like this clearly it's completely struggling
00:12:57with the graphics.
00:12:58We are just, I don't even know how to describe this,
00:13:01but hey, it was super cheap.
00:13:03So now let's take a look at what Claude Code
00:13:07was able to give us for reference.
00:13:09It took 13 minutes to actually execute the plan.
00:13:12The plan itself took five minutes.
00:13:13So let's call it 20 minutes to come up with the first pass.
00:13:17And then for total tokens,
00:13:19this run took about 15% plus the 5% before the plan.
00:13:22So we're looking at, well, sorry,
00:13:24we are looking at 11% context plus 5% before.
00:13:28So call it 20 minutes, 150,000 tokens for Claude Code,
00:13:33which is definitely the most expensive
00:13:34and slowest out of all of them.
00:13:36And here is Claude Code's attempt at this.
00:13:39For whatever reason, we are instantly in the air.
00:13:43We are stalling.
00:13:44We are an IFR.
00:13:45I don't know what's happening.
00:13:48We are about to crash something.
00:13:50Can we save this?
00:13:51Can we pull this out of a dive?
00:13:53No, we're stalling, no, we're dead.
00:13:54Okay, that's interesting.
00:13:56Again, it instantly slingshots us into the air.
00:14:00We are in the clouds.
00:14:02We are stalling.
00:14:03I don't know what is happening.
00:14:05We need, we need a second pass.
00:14:08So I wrote upon loading, I'm instantly thrown into the air.
00:14:11It's hard to control.
00:14:12I want to start on the runway and I want it easier to fly.
00:14:15Oh, and by the way, improve those graphics too.
00:14:17So it took about four minutes, but it made some changes.
00:14:20We're going to spawn on the runway.
00:14:22It changed the gear.
00:14:23So now it's tricycle gear and a few other stuff.
00:14:24So let's see what it looks like.
00:14:26Right, so here it is.
00:14:27Again, we are thrown immediately into a fog bank.
00:14:29I'm trying to control this thing.
00:14:31And I just, yeah, there's no controlling this at all.
00:14:33All right, we are going to give,
00:14:34we're going to give cloud code one more chance here.
00:14:37So I told it it's still instantly slingshotting me
00:14:39into the sky.
00:14:40I said, let's go with a much more arcade type feel
00:14:42with the controls.
00:14:43I think we probably should have done that
00:14:44with the initial prompts for all three.
00:14:46I think going for a more realistic SIM type thing,
00:14:50it really struggles to,
00:14:53I think do that in a way where it's still user-friendly.
00:14:57I think it's probably doing a good job under the hood
00:14:59in terms of like, okay, like angle of attack.
00:15:01All right, you're stalling at this, you know,
00:15:02angle versus the speed and all that.
00:15:04But actually manipulating this from the computer
00:15:07is basically impossible.
00:15:09Although I think the fog stuff is really strange.
00:15:12So let's see if after the second round of prompts
00:15:15it's able to do a little bit better
00:15:16because right now GPT 5.5 did much, much better.
00:15:20So cloud code made some more changes,
00:15:22made it more user-friendly.
00:15:23And let's see if I'm still going
00:15:24for my instrument rating this time.
00:15:26So yep, we're still going.
00:15:28We're still going for instrument rating.
00:15:30We're at men's here, but you know, I can kind of see it.
00:15:33You know, I can check my instrument panel.
00:15:35All right, we're coming off the runway.
00:15:37Yeah, okay.
00:15:42Can I, why is there a tree in the runway?
00:15:44I'm trying to go up.
00:15:46Can I go up?
00:15:47Can I pitch?
00:15:49Click canvas to lock mouse, what?
00:15:53Oh, we're in the air.
00:15:54Nope, nope, we died.
00:15:57So yeah, I think this one is pretty clear.
00:16:02GPT 5.5, easily the winner, I think.
00:16:06Cloud code was second place.
00:16:08I would give it second place.
00:16:10You know, it definitely struggled
00:16:13even with the prompts we gave it.
00:16:14We didn't give it great prompts, let's be totally honest.
00:16:16I think given more time, better prompts,
00:16:19a few more back and forths,
00:16:20we could have got it to where we want it to go.
00:16:21Like it was, at least it had an aircraft, it had a runway.
00:16:25It had trees in the runway,
00:16:26but it had the actual things we needed
00:16:29versus DeepSeek with OpenCODE.
00:16:32I had no idea what was going on there.
00:16:34That was a complete mess.
00:16:35I feel like I would have had to start over
00:16:36from the beginning, like give it a very specific prompt.
00:16:38Like it wasn't even close to being messed with,
00:16:39but GPT 5.5 right off the rip, you know,
00:16:42it was pretty vague prompts.
00:16:44I thought it did really good.
00:16:455.5 also used the total of 66K tokens.
00:16:48We're looking at over here with Opus all together,
00:16:52about 200,000 tokens.
00:16:53So quarter of the tokens, essentially quarter of the cost.
00:16:56And it was a bit faster.
00:16:58I mean, at this point, I don't even care
00:16:59about how OpenCODE actually took longer than GPT 5.5 as well.
00:17:03And it just sucked, let's just be honest, it just sucked.
00:17:07Now let's move on to test number two.
00:17:10This time we are going to be asking them
00:17:12to create a landing page that shows off WebGPU shader work
00:17:16using 3JS.
00:17:18Now WebGPU shader work is the kind of stuff you see
00:17:21on awards websites.
00:17:23I'm talking websites like Igloo, this kind of thing,
00:17:26like very high-end graphics.
00:17:28It looks like a video game.
00:17:29It's essentially using your computer's graphics card
00:17:32to render all this stuff.
00:17:34Now I don't expect any of these to get anything even close
00:17:37to what we see here, but I want to see what they can do
00:17:40using essentially the shaders technology.
00:17:42This is definitely a step above your basic
00:17:45SaaS templated landing page.
00:17:46I want to see what they can do and push them
00:17:48to the limits in the world of web design.
00:17:50Now I've given all of them a skill that actually breaks down
00:17:53how to do this sort of thing.
00:17:55So it's not like they're completely in the dark
00:17:57and one also doesn't have an advantage over the other.
00:18:00The only thing I've told them is I want it to feel modern
00:18:02and visually striking, something you would see on awards
00:18:05and to make smart use of GPU compute.
00:18:08So they can pick whatever stack and project structure
00:18:10they like and use good judgment on hero concept,
00:18:13UI and interactions.
00:18:15And just like the first test, they're all on plan mode.
00:18:17So let's get started.
00:18:18Okay, so they all finished their plan and funny enough,
00:18:21none of them asked me any questions,
00:18:22even though we put them in plan mode.
00:18:24So let's take a look at GPT 5.5 first.
00:18:28So it's telling us it's going to do a full bleed
00:18:30interactive GPU driven hero.
00:18:32The concept will be a living signal field
00:18:34with some like dense particle thing it's going to do.
00:18:36We'll see what that ends up looking like.
00:18:38And overall it's a minimal awards style landing copy.
00:18:41Fully interactive web GPU scene
00:18:43with pointer reactive compute simulation.
00:18:46All right, for DeepSeek it's a pretty short and sweet plan,
00:18:50just like we saw with the flight simulator.
00:18:53Hopefully we get a better output this time,
00:18:54but a hero section with 75,000 GPU computer particles.
00:18:58I am kind of guessing that all of them are going to go
00:19:01for some sort of like particle theme on the hero.
00:19:04So it's going to have mouse interaction, integration.
00:19:08It'll have a one-time initialization.
00:19:10And then we should see stuff like bloom,
00:19:13chromatic aberration, a custom vignette and some film grain.
00:19:16So we'll see what that actually ends up looking like.
00:19:19And then we have Opus 4.7 plan again,
00:19:21going for this particle thing with bloom
00:19:23and it's going to be interactive with the mouse.
00:19:25So we'll see if any of these actually look different
00:19:27because on the surface, all their plans sound very similar.
00:19:29So the first one done was 5.5.
00:19:32It took about six minutes.
00:19:34And in terms of tokens, we've used 107K.
00:19:37So let's see what it built us.
00:19:40And here's what it created for us.
00:19:42Now, this is very bright.
00:19:45So it's hard to even see the actual particles,
00:19:47but you know, as we scroll up and down,
00:19:50it does have an animation going on in the background
00:19:52as well as, you know, some subtle color changes.
00:19:56It looks like right now our mouse is supposed
00:20:00to attract the particles.
00:20:01And we have, I'll move this over here.
00:20:03It gave some options for like repelling it versus drift.
00:20:08But again, it's kind of tough to see it
00:20:11due to how bright it is.
00:20:12So I told it it's hard to actually see the particles
00:20:14due to the brightness.
00:20:14It also takes over a lot of the hero tech.
00:20:16So can we turn down the brightness a bit
00:20:18and also push it to the right a bit more?
00:20:20Because right now it is kind of overpowering.
00:20:23You can't even really read the text over here on the left
00:20:25due to just how freaking bright these particles are.
00:20:27And here's the update after the second run.
00:20:30It's a little bit better.
00:20:31It isn't as overpowering and leaves some room for the text.
00:20:35Although I will say it's kind of blurry almost,
00:20:39but you know, it's not bad.
00:20:41Like it's set out to do what we told it to do
00:20:44given the somewhat vague problem.
00:20:46So I'm not blown away by sort of the design it came up with,
00:20:49but I'm not like upset about it.
00:20:51Now let's take a look at Claude Code
00:20:52because as we've been doing all this,
00:20:55DeepSeek is still over here in the trenches
00:20:57trying to figure this out.
00:20:58And here's what Claude Code gave us.
00:21:01So kind of nothing.
00:21:06I'm not sure if it's saying the background,
00:21:10I guess the entire background is supposed to be
00:21:14the WebGL, I'm assuming.
00:21:19It's very understated,
00:21:21which I guess is something you could totally do.
00:21:24I mean, like on screen it doesn't look,
00:21:25like it looks kind of cool, but I'll be honest,
00:21:28I was looking for something a little more flashy.
00:21:31So on the second pass,
00:21:31when I told it to make it a bit more flashy,
00:21:34there wasn't a huge difference.
00:21:35Although like it's really subtle.
00:21:38There's kind of like this film grain,
00:21:40almost like this blur that goes from bottom to top.
00:21:43So it's a pretty subtle thing.
00:21:45And you can see here on the bottom,
00:21:47it tracks like the frames per second.
00:21:49It's using 250,000 particles.
00:21:51So, I mean, honestly it looks cool.
00:21:54It's just not super flashy.
00:21:56So it's definitely like a taste thing.
00:21:58Now total tokens on the Cloud Code side was about 175,000,
00:22:01and it took just slightly longer than 5.5 inside of Codex.
00:22:05Now let's take a look at DeepSeek,
00:22:07which has taken 116,000 tokens at this point.
00:22:10It took the longest as well,
00:22:12but total costs we're talking again, under a dollar.
00:22:15And here's what it gave us.
00:22:17So it's kind of this particle field thing
00:22:21that somewhat follows my mouse.
00:22:25Interesting.
00:22:27I think it might give you like an epileptic seizure.
00:22:29Honestly, beyond that, it's pretty bland.
00:22:35The flux, you know, X-ray here kind of changes colors,
00:22:39but yeah, pretty much just created this thing.
00:22:43After telling DeepSeek to do another pass,
00:22:45it then came back with this,
00:22:46where now it kind of has like some weird parallax thing.
00:22:49It's got some like blue stuff going on in the background.
00:22:53And now this thing that's like a UFO,
00:22:55which kind of responds to your mouse,
00:22:58but yeah, it's something.
00:23:02And overall, the token count from DeepSeek was 130K tokens
00:23:05coming in at $1.43.
00:23:08So after all those tests, where does that really leave us?
00:23:13So now let's talk about the final results.
00:23:15When it comes to test number one,
00:23:16which was the flight simulator, clear winner.
00:23:18That was GPT 5.5 inside of Codex.
00:23:21It was quicker than Opus 4.7 inside of Claude Code.
00:23:25It was also faster and the end result was by far the best.
00:23:29DeepSeek did terribly in the flight simulator.
00:23:32It wasn't even close to what we were trying to do.
00:23:34I would have had to continue to prompt it,
00:23:35prompt it, prompt it to even get it to like close
00:23:38to the first pass from 5.5 and Opus 4.7 and Claude Code
00:23:43was like, eh, it wasn't awful.
00:23:46Like it really didn't work at the beginning,
00:23:48but after a couple of prompts, you could tell,
00:23:50we could get it to a place where it was equivalent
00:23:52to what GPT 5.5 was doing.
00:23:54That would have taken more prompts.
00:23:55It would have taken more time
00:23:57and ultimately it would be more expensive.
00:23:59So clear winner for 5.5.
00:24:01In terms of the web GPU landing page,
00:24:03again, DeepSeek struggled here.
00:24:04I was not a fan of this.
00:24:06I don't really know what this is supposed to be.
00:24:08Sure, I didn't give it a super great prompt,
00:24:10but like, is this what we're gonna be getting
00:24:13as a baseline median outcome?
00:24:16If I don't like grab DeepSeek by the reins
00:24:19and really force it to do something, I guess so.
00:24:22Now, when we compare Opus in 5.5,
00:24:24I would have gone with Opus 4.7 and Claude Code
00:24:27with how it handled the web GPU thing.
00:24:29I think that has to do with sort of a taste kind of deal.
00:24:31Yeah, you could argue the 5.5 was flashier,
00:24:35but I thought it was kind of ugly.
00:24:37Again, in all these tests, we kept the prompts rather vague
00:24:41to see what sort of path it would go down.
00:24:43So I would definitely give Opus the lead here,
00:24:46although it was more expensive
00:24:48and it also took slightly longer.
00:24:50So if they were given a more hands-on prompt
00:24:55that was very specific about what you wanted to do,
00:24:57because 5.5 did what we wanted it to do.
00:24:59Like it did create a web GPU landing page.
00:25:02I just thought it was ugly.
00:25:04So it still completed the task.
00:25:06It just didn't complete it as well, I think, as Opus.
00:25:08Now, big picture, what does it mean
00:25:09if we take all that together?
00:25:11Well, I think it means great news
00:25:13for anybody who's using agent decoders.
00:25:16We have options, right?
00:25:18You can use Opus and Clod code,
00:25:20or you can use GPT 5.5 and codecs.
00:25:23You're not wrong with either.
00:25:25I think it's totally a personal preference at this point.
00:25:28And the best part is if you go down the Clod code route,
00:25:31it pretty much all applies to codecs.
00:25:33If you go down the codecs route,
00:25:34it pretty much all applies to Clod code.
00:25:37So I don't really think there's vendor lock in the sense like,
00:25:40oh, I've only learned about Clod code.
00:25:42Like I can't go to codecs or vice versa.
00:25:44That's not the case at all.
00:25:45If you're doing this the right way,
00:25:46what you're really learning is AI fundamentals
00:25:48and how to build things.
00:25:49And that applies to both of these guys.
00:25:51And the more competition,
00:25:53the better it is for us, the consumer.
00:25:54Now, as for DeepSeek, eh, I don't know.
00:25:59I wasn't very impressed.
00:26:00This might be a situation where like, okay,
00:26:02like DeepSeek makes sense if we're doing simpler tasks
00:26:04where we just don't need the power of something like Opus,
00:26:06or we just don't need the power of something like GPT 5.5.
00:26:10Because remember, we're talking about something
00:26:11that is eight times cheaper.
00:26:13Sure, I didn't like the WebGPU landing pages
00:26:16thing came up with, but was it eight times worse?
00:26:19Maybe, maybe not.
00:26:21Kind of hard to actually, you know,
00:26:23articulate that and quantify that.
00:26:24But obviously that's something we need to take into account.
00:26:27So, you know, I don't think it's really competition
00:26:30to be frank with 4.7 or 5.5.
00:26:33I think though, if you're doing simpler tasks
00:26:35and you're like very token conscious, very cash conscious,
00:26:38then hey, maybe DeepSeek makes sense for you.
00:26:41So that's all I got for you guys today.
00:26:42I hope that sheds some light on these three models
00:26:45and how they kind of stack up to one another.
00:26:47I think it's a great time to be in the space.
00:26:49More competition is better for everyone.
00:26:51So as always, if you want to get your hands
00:26:53on the Claude Code Masterclass,
00:26:55make sure to check out Chase AI Plus.
00:26:56There's a link to that in the description.
00:26:58And I'll see you around.