Opus 4.7 Is GREAT (except the token usage)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

컴퓨터/소프트웨어경제 뉴스AI/미래기술

Transcript

00:00:00The new best model is here, Opus 4.7. It actually looks like a pretty good upgrade, obviously

00:00:05it's better at coding but it also has improved vision, self-verification and it's supposedly

00:00:09better at UI making them more tasteful and creative.

00:00:12The downside though is that while the cost didn't change, the tokenizer did so the exact

00:00:17same input prompt could now use up to 35% more tokens and it also thinks more so that's even

00:00:22more tokens to burn. There's definitely some really interesting details in this release

00:00:26and probably a change you want to make to Claw code now so let's just jump in, see what's

00:00:30new and test it out.

00:00:31Now I'm actually going to start with the benchmarks because I kinda lied earlier when I said this

00:00:40was the new best model. It's the best publicly available one but these benchmarks also include

00:00:44Mythos, the model so powerful that we're not allowed it yet.

00:00:47According to Anthropic, Opus 4.7 is actually testing new cyber safeguards to block requests

00:00:52that indicate prohibited or high-risk cybersecurity uses and what they learn from that is going

00:00:56to help them work to a broad release of the Mythos class models so hopefully in the future

00:01:00I can make a video on the Mythos release and how it's the end of software development as

00:01:03we know it. So subscribe if you don't want to miss that one.

00:01:06For now I'll go ahead and ignore Mythos and focus on the one that we can actually use which

00:01:10is Opus 4.7 and this has actually made great gains on the benchmarks.

00:01:13Now I won't go into too much detail on these and you can pause the screen if you want to

00:01:16read the individual ones. You can see on benchmarks like SWE Bench Pro it's actually made a 10%

00:01:21leap over Opus 4.6 and on verified it's made a 7% one and that pattern pretty much continues

00:01:26for the rest of the benchmarks except in cybersecurity where it actually went slightly down seemingly

00:01:30related to the safeguards that I mentioned earlier it seems that artificially keeping

00:01:34this score low to try and save the world or something.

00:01:37I also found a really interesting benchmark in that system card where it appears that the

00:01:40long context performance has seemingly taken a nosedive compared to Opus 4.6 when using

00:01:45a needle in a haystack test so I'm pretty curious how that's going to impact actual usage over

00:01:50time. Outside of the benchmarks there's also a few other notable improvements that might

00:01:54even change how you use Claude. The first one is that it's better instruction following

00:01:58which actually means that you might have unexpected results with prompts that you've already used

00:02:01before as older models interpreted instructions loosely or skipped parts whereas Opus 4.7 is

00:02:07really focused on taking instructions literally so you might actually have some prompt tweaking

00:02:11to do. Next it's got improved multimodal support so it can accept higher resolution images three

00:02:16times that of the older models so this should make it better at tasks like computer use and

00:02:20data extraction. It's memory use also improved so Opus 4.7 should be better at using file

00:02:25system based memory where it remembers important notes across long multi-session work and uses

00:02:30those to move on to new tasks that as a result need less upfront context. So maybe that will

00:02:34save me a few tokens which is pretty important now as the next change is to the tokenizer

00:02:39and thinking. Opus 4.7 uses an updated tokenizer that improves how the model processes text

00:02:45but it also means that the same input prompt can cost up to 35% more tokens and when you

00:02:49combine this with the fact that Opus 4.7 thinks more at higher effort levels this model is

00:02:54really going to burn through some tokens. To make this worse there's also a new extra

00:02:58high effort level and it's actually set as the default in-claw code so I highly recommend

00:03:02you go and test out the various effort levels and find the one that suits you best to see

00:03:05if you could possibly downgrade this without noticing an impact. For comparison the new

00:03:09extra high effort level uses roughly the same amount of tokens as Opus 4.6's max effort

00:03:14level and the Opus 4.7 high effort level actually outscores Opus 4.6's max effort level with

00:03:19less tokens used. So if you're already comfortable with what you had before I'd use that chart

00:03:24to compare because I know for me I'm probably going to change this to be using the high effort

00:03:27level in most cases. With the tldr of what's new out of the way I'm going to burn through

00:03:31my usage and test this. The first thing I'm going to check is is it better at UI design

00:03:35so I gave it a very simple prompt to create a cafe website with an index.html only and

00:03:40I'm using the max effort level on all of the models I'm testing so I'm going to try this

00:03:43out in Opus 4.7, 4.6, Gemini 3.1 and GPT 5.4. This is the result I got back from Opus 4.7

00:03:51and I think it looks pretty nice it's got a nice sort of cafe feel to it it's used a

00:03:55nice font it's picked up images from Unsplash here. Overall I can't really complain it's

00:03:59a pretty simple website has a nice menu section everything is actually responsive and overall

00:04:04yeah I'd say it looks pretty good. If we compare this to what Opus 4.6 gave me you can see it

00:04:09went for a bit of a different style here but it's got a similar font and a similar menu

00:04:12section and overall it's a little bit worse I would say just because it hasn't used a nice

00:04:16background here and this gradient is not a nice switch at all but still can't complain

00:04:20too much I'd say Opus 4.7 is only a bit of a step above this. Gemini 3.1 on the other

00:04:25hand I think gave me my best result at least this one is my favorite so let me know in the

00:04:29comments below what yours is I just really like that it's got this background that doesn't

00:04:33move when we scroll I think it's done really well with this image section here in the our

00:04:36story section the menu looks similar to the other ones but again I think this is nicely

00:04:40laid out and the same with the footer so I think 3.1 wins on this one for me. Coming

00:04:45in last place though is definitely GPT 5.4 this just has such a GPT look and feel to it

00:04:50it loves these sort of cards where it has a nice blur to them and it's just not a good

00:04:55cafe website in my opinion it just looks like every other GPT app that I have ever seen so

00:04:59Opus 4.7 is definitely good at UI and it will probably handle it even better given some more

00:05:04direction at the moment on design arena Opus 4.6 actually takes the lead for websites so

00:05:09I do expect that 4.7 will take its place. Now obviously that test was a pretty simple

00:05:13one so next I'm going to give them all a more advanced task you can see here in Claude code

00:05:17with Opus 4.6 I'm asking for a personal finance management dashboard that offers a detailed

00:05:21overview of an individual's financial health with a load of features that I have in the

00:05:25prompt here and I'm not giving it any indication of the stack that it should use it is going

00:05:30to pick all of that and start from scratch. Up first we have the result of Opus 4.7 and

00:05:34it did this all in a single prompt in around 20 minutes and my initial reaction is just

00:05:39wow this looks really good the UI is really clean it's got really nice charts here everything

00:05:44is laid out nicely it uses a good color scheme and to be honest with you there's not much

00:05:48that I would improve about this myself it has done a fantastic job on the UI side of things

00:05:53and it also has all of the individual pages that I asked for we can see all of our accounts

00:05:57we can see our transactions and our budgets we can't actually add any new budgets at the

00:06:02moment it seems that that isn't a feature and the same with the goals but we are able

00:06:05to add into our goals here and the numbers do go up and it does update the back end API

00:06:10which it built and the same thing goes for if we send money to people as well so if I

00:06:14just test paying for my Claude code subscription here this should send successfully and I can

00:06:17see it has been sent and back on the dashboard my net worth has been updated with that transaction

00:06:22so everything is working there and it is using a database on the back end and we also have

00:06:26it showing up in our recent transactions looking through the code they generated everything

00:06:30looks pretty good it used react and veet for my front end so the same thing I would have

00:06:34done and it also used react router maybe I would have used tan stack but it doesn't really

00:06:38matter they're both pretty good options in all of these you can see everything is laid

00:06:42out neatly we have all of our individual UI components overall the front end is just pretty

00:06:46well done the place where I will mark it off for is in the back end because we are using

00:06:51an express server there's nothing really wrong with that but I would have gone with something

00:06:54like bun maybe or hono for just how simple this app is and also the way that it's actually

00:06:59storing this data is all in memory so if I now shut down the back end service and start

00:07:04it up again it's going to load in the data from this seed script and this is just local

00:07:08arrays it didn't have any database to back this up to moving on to our opus 4.6 gave me

00:07:13I've got to say straight away opus 4.7 definitely did a better job when it comes to the UI design

00:07:18there's just something about this UI that I don't quite like I don't know if it's got a

00:07:21bit too much padding or if it's the fact that it's in light mode whereas the other one was

00:07:24in dark mode I just definitely prefer the opus 4.7 one overall it's got pretty similar components

00:07:29though you can see we've got the cards with our net worth we've got a net worth trend graph

00:07:33recent transactions and our financial goals and we also have the individual pages to track

00:07:38these as well besides the UI we can also test out some of the features so I'll add a new

00:07:42transaction here this one is going to be a hundred and fifty dollars for groceries it

00:07:46does look like we get an update here and also back on the dashboard my net worth updated

00:07:50as well so it does seem to be working there one place opus 4.6 might have actually be opus

00:07:544.7 in the single prompt is that I can add accounts here so I just added this account

00:07:58and the same thing goes for the goals and the budget so I also added the education budget

00:08:03so it looks like opus 4.6 added in a few more features but to be honest with you I just

00:08:07asked opus 4.7 to add them in for me obviously normally you wouldn't be doing a single prompt

00:08:12taking a look at the code opus 4.6 went down a similar route with a vreact application but

00:08:16one interesting thing that I've just noticed is this is using react 19 and react router

00:08:20dom 7 whereas opus 4.7 went with react 18 and also react router 6 even though I'm pretty

00:08:27sure opus 4.7 has the newer knowledge cutoff besides that another win for opus 4.6 is that

00:08:32it did use a database for the back end so it will be persisting it you can see it's using

00:08:36a sqlite one here and we do have some of the databases so that's definitely a win but where

00:08:40it loses is it seemingly used javascript for all of this project whereas opus 4.7 correctly

00:08:45used typescript next we have the result of gpt 5.4 and to be honest with you I have no

00:08:50idea what it's doing here this is not a usable ui it looks really bad in my opinion everything

00:08:55is really cluttered I don't like the font and I yeah it's I'm not really going to spend

00:08:59much time on this this just looks way worse than the clawed ones I can confirm though that

00:09:03it does work when we add in some money except it just refreshes the entire page as well it

00:09:07doesn't get much better in the code either seemingly gpt 5.4 just didn't want to start

00:09:11a full project from this so it's just gone with a very simple approach where we just have

00:09:14our index.html our javascript file and our styles and for the database that's also just

00:09:19a single javascript script as well it's not actually using a database it's doing it all

00:09:23in memory like opus 4.7 and again it's also gone with javascript for everything instead

00:09:28of typescript as for gemini 3.1 I'll be honest with you I had a lot of issues trying to get

00:09:32this app to run and actually had to send multiple follow-up prompts just because I was curious

00:09:36at what this actually looked like and it kind of looks exactly like the opus 4.6 one I don't

00:09:41know if they have the same training data when they were doing the ui but it's very similar

00:09:45and none of these features actually work and none of these tabs are clickable gemini 3.1

00:09:50probably did the worst even though 5.4 is up there just because of the way that it created

00:09:54the app I will say gemini 3.1 did actually try and take a good approach to this actually

00:09:59went with next.js instead of react router which is a pretty good idea because it means you

00:10:02can use the api server routes and this was a pretty simple app so I'm not opposed to doing

00:10:07that but I will say it did use prisma where I would have preferred something like drizzle

00:10:10these tests honestly surprised me because up until now I've been a pretty heavy codex user

00:10:15and I've moved away from claud code but opus 4.7 might just claw me back because it had

00:10:19a really nice ui design and most of the app seemed to work obviously it does come down

00:10:24to the prompting quality and I was giving quite a vague prompt on the stack I'd normally prompt

00:10:28with the exact things that I want but still I am pretty impressed with the result that

00:10:32we got here I'm curious what you think what's your model of choice at the moment let me

00:10:36know in the description down below where you're there subscribe and as always see you in the

00:10:49next one.

Key Takeaway

Opus 4.7 delivers superior UI design and coding accuracy through a literal instruction-following engine, though it increases operational costs by up to 35% per prompt due to its new tokenizer and high-effort thinking defaults.

Highlights

Opus 4.7 provides a 10% performance increase on the SWE Bench Pro benchmark and a 7% leap on verified coding tasks compared to version 4.6.

Input prompts can cost up to 35% more tokens due to an updated tokenizer, while the new default 'extra high' effort level in Claude Code uses roughly the same tokens as the previous version's maximum setting.

Visual processing capabilities allow for image resolutions three times higher than previous models, improving data extraction and computer-use tasks.

Instruction following is strictly literal in Opus 4.7, requiring users to adjust existing prompts that relied on the loose interpretation of previous versions.

Long-context performance in needle-in-a-haystack tests shows a significant decline compared to version 4.6, despite improvements in multi-session file system memory.

In a single-prompt 20-minute test for a financial dashboard, Opus 4.7 generated a functional React 18/TypeScript application with higher UI quality than GPT 5.4 or Gemini 3.1.

Timeline

Core Upgrades and Token Economics

Opus 4.7 introduces improved vision, self-verification, and more creative UI generation capabilities.
The updated tokenizer increases token usage by up to 35% for identical input prompts compared to older models.
Anthropic is using Opus 4.7 to test cyber safeguards required for the future release of the more powerful Mythos model class.

The release focuses on refining coding and vision performance while introducing a higher cost per interaction. The model processes text differently, meaning the 'token burn' is higher even before accounting for increased internal reasoning. These changes serve as a bridge toward Mythos, a model restricted from public release until cybersecurity risks are mitigated.

Benchmark Performance and Instruction Following

SWE Bench Pro scores increased by 10%, while cybersecurity scores slightly decreased due to artificial safety safeguards.
Needle-in-a-haystack tests reveal a performance regression in long-context retrieval compared to Opus 4.6.
The model interprets instructions literally, which may break prompts designed for the more flexible interpretations of previous versions.

Benchmark data shows consistent gains in software engineering tasks but highlights a specific weakness in long-context reliability. The shift toward literal instruction following means prompts must be more precise. Safety protocols are intentionally suppressing scores in specific high-risk categories like cybersecurity to prevent misuse.

Vision, Memory, and Effort Levels

Multimodal support now handles images with 3x the resolution of previous versions.
File system memory improvements allow the model to retain notes across multiple sessions to reduce upfront context needs.
Opus 4.7 High effort level outperforms Opus 4.6 Max effort level while using fewer tokens than the new Extra High default.

Higher resolution image support targets improvements in automated computer use and complex data extraction from visuals. Memory management now relies more on persistent file notes, potentially offsetting some token costs over long projects. Users can optimize costs by downgrading from the 'Extra High' default to 'High' effort without losing performance relative to the previous 4.6 model.

UI Design Comparison and Competition

Opus 4.7 creates responsive websites with integrated assets from Unsplash in a single prompt.
Gemini 3.1 produced the most aesthetically pleasing cafe website result in head-to-head testing.
GPT 5.4 remains behind in aesthetic variety, often defaulting to a repetitive card-based design language.

Testing across a simple cafe website prompt shows that while Opus 4.7 is a step above 4.6 in font choice and layout, Gemini 3.1 currently leads in visual composition for simple landing pages. GPT 5.4 shows the least progress in design creativity, maintaining a highly recognizable and generic style. Opus 4.7 remains a top contender for UI tasks due to its ability to handle responsive layouts and asset integration.

Full-Stack Development Stress Test

Opus 4.7 built a functional financial dashboard with React, Vite, and TypeScript in 20 minutes.
The model correctly utilized TypeScript for the entire project, whereas GPT 5.4 and Opus 4.6 defaulted to JavaScript or older React versions.
Gemini 3.1 attempted a Next.js architecture but failed to produce clickable tabs or working features in a single prompt.

In a complex task involving a financial dashboard with a backend API, Opus 4.7 provided the best balance of UI quality and code structure. Although it used in-memory storage instead of a persistent SQLite database like 4.6, its use of modern TypeScript made it the most professional output. GPT 5.4 failed to generate a full project structure, providing only basic HTML and CSS files, while Gemini 3.1's more advanced Next.js approach resulted in a broken user interface.

Community Posts

Prompt Engineering Strategies to Suppress Increased Token Consumption in Opus 4.7

makedream5일 전4020

Write about this video