00:00:00The new best model is here, Opus 4.7. It actually looks like a pretty good upgrade, obviously
00:00:05it's better at coding but it also has improved vision, self-verification and it's supposedly
00:00:09better at UI making them more tasteful and creative.
00:00:12The downside though is that while the cost didn't change, the tokenizer did so the exact
00:00:17same input prompt could now use up to 35% more tokens and it also thinks more so that's even
00:00:22more tokens to burn. There's definitely some really interesting details in this release
00:00:26and probably a change you want to make to Claw code now so let's just jump in, see what's
00:00:30new and test it out.
00:00:31Now I'm actually going to start with the benchmarks because I kinda lied earlier when I said this
00:00:40was the new best model. It's the best publicly available one but these benchmarks also include
00:00:44Mythos, the model so powerful that we're not allowed it yet.
00:00:47According to Anthropic, Opus 4.7 is actually testing new cyber safeguards to block requests
00:00:52that indicate prohibited or high-risk cybersecurity uses and what they learn from that is going
00:00:56to help them work to a broad release of the Mythos class models so hopefully in the future
00:01:00I can make a video on the Mythos release and how it's the end of software development as
00:01:03we know it. So subscribe if you don't want to miss that one.
00:01:06For now I'll go ahead and ignore Mythos and focus on the one that we can actually use which
00:01:10is Opus 4.7 and this has actually made great gains on the benchmarks.
00:01:13Now I won't go into too much detail on these and you can pause the screen if you want to
00:01:16read the individual ones. You can see on benchmarks like SWE Bench Pro it's actually made a 10%
00:01:21leap over Opus 4.6 and on verified it's made a 7% one and that pattern pretty much continues
00:01:26for the rest of the benchmarks except in cybersecurity where it actually went slightly down seemingly
00:01:30related to the safeguards that I mentioned earlier it seems that artificially keeping
00:01:34this score low to try and save the world or something.
00:01:37I also found a really interesting benchmark in that system card where it appears that the
00:01:40long context performance has seemingly taken a nosedive compared to Opus 4.6 when using
00:01:45a needle in a haystack test so I'm pretty curious how that's going to impact actual usage over
00:01:50time. Outside of the benchmarks there's also a few other notable improvements that might
00:01:54even change how you use Claude. The first one is that it's better instruction following
00:01:58which actually means that you might have unexpected results with prompts that you've already used
00:02:01before as older models interpreted instructions loosely or skipped parts whereas Opus 4.7 is
00:02:07really focused on taking instructions literally so you might actually have some prompt tweaking
00:02:11to do. Next it's got improved multimodal support so it can accept higher resolution images three
00:02:16times that of the older models so this should make it better at tasks like computer use and
00:02:20data extraction. It's memory use also improved so Opus 4.7 should be better at using file
00:02:25system based memory where it remembers important notes across long multi-session work and uses
00:02:30those to move on to new tasks that as a result need less upfront context. So maybe that will
00:02:34save me a few tokens which is pretty important now as the next change is to the tokenizer
00:02:39and thinking. Opus 4.7 uses an updated tokenizer that improves how the model processes text
00:02:45but it also means that the same input prompt can cost up to 35% more tokens and when you
00:02:49combine this with the fact that Opus 4.7 thinks more at higher effort levels this model is
00:02:54really going to burn through some tokens. To make this worse there's also a new extra
00:02:58high effort level and it's actually set as the default in-claw code so I highly recommend
00:03:02you go and test out the various effort levels and find the one that suits you best to see
00:03:05if you could possibly downgrade this without noticing an impact. For comparison the new
00:03:09extra high effort level uses roughly the same amount of tokens as Opus 4.6's max effort
00:03:14level and the Opus 4.7 high effort level actually outscores Opus 4.6's max effort level with
00:03:19less tokens used. So if you're already comfortable with what you had before I'd use that chart
00:03:24to compare because I know for me I'm probably going to change this to be using the high effort
00:03:27level in most cases. With the tldr of what's new out of the way I'm going to burn through
00:03:31my usage and test this. The first thing I'm going to check is is it better at UI design
00:03:35so I gave it a very simple prompt to create a cafe website with an index.html only and
00:03:40I'm using the max effort level on all of the models I'm testing so I'm going to try this
00:03:43out in Opus 4.7, 4.6, Gemini 3.1 and GPT 5.4. This is the result I got back from Opus 4.7
00:03:51and I think it looks pretty nice it's got a nice sort of cafe feel to it it's used a
00:03:55nice font it's picked up images from Unsplash here. Overall I can't really complain it's
00:03:59a pretty simple website has a nice menu section everything is actually responsive and overall
00:04:04yeah I'd say it looks pretty good. If we compare this to what Opus 4.6 gave me you can see it
00:04:09went for a bit of a different style here but it's got a similar font and a similar menu
00:04:12section and overall it's a little bit worse I would say just because it hasn't used a nice
00:04:16background here and this gradient is not a nice switch at all but still can't complain
00:04:20too much I'd say Opus 4.7 is only a bit of a step above this. Gemini 3.1 on the other
00:04:25hand I think gave me my best result at least this one is my favorite so let me know in the
00:04:29comments below what yours is I just really like that it's got this background that doesn't
00:04:33move when we scroll I think it's done really well with this image section here in the our
00:04:36story section the menu looks similar to the other ones but again I think this is nicely
00:04:40laid out and the same with the footer so I think 3.1 wins on this one for me. Coming
00:04:45in last place though is definitely GPT 5.4 this just has such a GPT look and feel to it
00:04:50it loves these sort of cards where it has a nice blur to them and it's just not a good
00:04:55cafe website in my opinion it just looks like every other GPT app that I have ever seen so
00:04:59Opus 4.7 is definitely good at UI and it will probably handle it even better given some more
00:05:04direction at the moment on design arena Opus 4.6 actually takes the lead for websites so
00:05:09I do expect that 4.7 will take its place. Now obviously that test was a pretty simple
00:05:13one so next I'm going to give them all a more advanced task you can see here in Claude code
00:05:17with Opus 4.6 I'm asking for a personal finance management dashboard that offers a detailed
00:05:21overview of an individual's financial health with a load of features that I have in the
00:05:25prompt here and I'm not giving it any indication of the stack that it should use it is going
00:05:30to pick all of that and start from scratch. Up first we have the result of Opus 4.7 and
00:05:34it did this all in a single prompt in around 20 minutes and my initial reaction is just
00:05:39wow this looks really good the UI is really clean it's got really nice charts here everything
00:05:44is laid out nicely it uses a good color scheme and to be honest with you there's not much
00:05:48that I would improve about this myself it has done a fantastic job on the UI side of things
00:05:53and it also has all of the individual pages that I asked for we can see all of our accounts
00:05:57we can see our transactions and our budgets we can't actually add any new budgets at the
00:06:02moment it seems that that isn't a feature and the same with the goals but we are able
00:06:05to add into our goals here and the numbers do go up and it does update the back end API
00:06:10which it built and the same thing goes for if we send money to people as well so if I
00:06:14just test paying for my Claude code subscription here this should send successfully and I can
00:06:17see it has been sent and back on the dashboard my net worth has been updated with that transaction
00:06:22so everything is working there and it is using a database on the back end and we also have
00:06:26it showing up in our recent transactions looking through the code they generated everything
00:06:30looks pretty good it used react and veet for my front end so the same thing I would have
00:06:34done and it also used react router maybe I would have used tan stack but it doesn't really
00:06:38matter they're both pretty good options in all of these you can see everything is laid
00:06:42out neatly we have all of our individual UI components overall the front end is just pretty
00:06:46well done the place where I will mark it off for is in the back end because we are using
00:06:51an express server there's nothing really wrong with that but I would have gone with something
00:06:54like bun maybe or hono for just how simple this app is and also the way that it's actually
00:06:59storing this data is all in memory so if I now shut down the back end service and start
00:07:04it up again it's going to load in the data from this seed script and this is just local
00:07:08arrays it didn't have any database to back this up to moving on to our opus 4.6 gave me
00:07:13I've got to say straight away opus 4.7 definitely did a better job when it comes to the UI design
00:07:18there's just something about this UI that I don't quite like I don't know if it's got a
00:07:21bit too much padding or if it's the fact that it's in light mode whereas the other one was
00:07:24in dark mode I just definitely prefer the opus 4.7 one overall it's got pretty similar components
00:07:29though you can see we've got the cards with our net worth we've got a net worth trend graph
00:07:33recent transactions and our financial goals and we also have the individual pages to track
00:07:38these as well besides the UI we can also test out some of the features so I'll add a new
00:07:42transaction here this one is going to be a hundred and fifty dollars for groceries it
00:07:46does look like we get an update here and also back on the dashboard my net worth updated
00:07:50as well so it does seem to be working there one place opus 4.6 might have actually be opus
00:07:544.7 in the single prompt is that I can add accounts here so I just added this account
00:07:58and the same thing goes for the goals and the budget so I also added the education budget
00:08:03so it looks like opus 4.6 added in a few more features but to be honest with you I just
00:08:07asked opus 4.7 to add them in for me obviously normally you wouldn't be doing a single prompt
00:08:12taking a look at the code opus 4.6 went down a similar route with a vreact application but
00:08:16one interesting thing that I've just noticed is this is using react 19 and react router
00:08:20dom 7 whereas opus 4.7 went with react 18 and also react router 6 even though I'm pretty
00:08:27sure opus 4.7 has the newer knowledge cutoff besides that another win for opus 4.6 is that
00:08:32it did use a database for the back end so it will be persisting it you can see it's using
00:08:36a sqlite one here and we do have some of the databases so that's definitely a win but where
00:08:40it loses is it seemingly used javascript for all of this project whereas opus 4.7 correctly
00:08:45used typescript next we have the result of gpt 5.4 and to be honest with you I have no
00:08:50idea what it's doing here this is not a usable ui it looks really bad in my opinion everything
00:08:55is really cluttered I don't like the font and I yeah it's I'm not really going to spend
00:08:59much time on this this just looks way worse than the clawed ones I can confirm though that
00:09:03it does work when we add in some money except it just refreshes the entire page as well it
00:09:07doesn't get much better in the code either seemingly gpt 5.4 just didn't want to start
00:09:11a full project from this so it's just gone with a very simple approach where we just have
00:09:14our index.html our javascript file and our styles and for the database that's also just
00:09:19a single javascript script as well it's not actually using a database it's doing it all
00:09:23in memory like opus 4.7 and again it's also gone with javascript for everything instead
00:09:28of typescript as for gemini 3.1 I'll be honest with you I had a lot of issues trying to get
00:09:32this app to run and actually had to send multiple follow-up prompts just because I was curious
00:09:36at what this actually looked like and it kind of looks exactly like the opus 4.6 one I don't
00:09:41know if they have the same training data when they were doing the ui but it's very similar
00:09:45and none of these features actually work and none of these tabs are clickable gemini 3.1
00:09:50probably did the worst even though 5.4 is up there just because of the way that it created
00:09:54the app I will say gemini 3.1 did actually try and take a good approach to this actually
00:09:59went with next.js instead of react router which is a pretty good idea because it means you
00:10:02can use the api server routes and this was a pretty simple app so I'm not opposed to doing
00:10:07that but I will say it did use prisma where I would have preferred something like drizzle
00:10:10these tests honestly surprised me because up until now I've been a pretty heavy codex user
00:10:15and I've moved away from claud code but opus 4.7 might just claw me back because it had
00:10:19a really nice ui design and most of the app seemed to work obviously it does come down
00:10:24to the prompting quality and I was giving quite a vague prompt on the stack I'd normally prompt
00:10:28with the exact things that I want but still I am pretty impressed with the result that
00:10:32we got here I'm curious what you think what's your model of choice at the moment let me
00:10:36know in the description down below where you're there subscribe and as always see you in the
00:10:49next one.