Opus 4.7 Is GREAT (except the token usage)

BBetter Stack
컴퓨터/소프트웨어경제 뉴스AI/미래기술

Transcript

00:00:00The new best model is here, Opus 4.7. It actually looks like a pretty good upgrade, obviously
00:00:05it's better at coding but it also has improved vision, self-verification and it's supposedly
00:00:09better at UI making them more tasteful and creative.
00:00:12The downside though is that while the cost didn't change, the tokenizer did so the exact
00:00:17same input prompt could now use up to 35% more tokens and it also thinks more so that's even
00:00:22more tokens to burn. There's definitely some really interesting details in this release
00:00:26and probably a change you want to make to Claw code now so let's just jump in, see what's
00:00:30new and test it out.
00:00:31Now I'm actually going to start with the benchmarks because I kinda lied earlier when I said this
00:00:40was the new best model. It's the best publicly available one but these benchmarks also include
00:00:44Mythos, the model so powerful that we're not allowed it yet.
00:00:47According to Anthropic, Opus 4.7 is actually testing new cyber safeguards to block requests
00:00:52that indicate prohibited or high-risk cybersecurity uses and what they learn from that is going
00:00:56to help them work to a broad release of the Mythos class models so hopefully in the future
00:01:00I can make a video on the Mythos release and how it's the end of software development as
00:01:03we know it. So subscribe if you don't want to miss that one.
00:01:06For now I'll go ahead and ignore Mythos and focus on the one that we can actually use which
00:01:10is Opus 4.7 and this has actually made great gains on the benchmarks.
00:01:13Now I won't go into too much detail on these and you can pause the screen if you want to
00:01:16read the individual ones. You can see on benchmarks like SWE Bench Pro it's actually made a 10%
00:01:21leap over Opus 4.6 and on verified it's made a 7% one and that pattern pretty much continues
00:01:26for the rest of the benchmarks except in cybersecurity where it actually went slightly down seemingly
00:01:30related to the safeguards that I mentioned earlier it seems that artificially keeping
00:01:34this score low to try and save the world or something.
00:01:37I also found a really interesting benchmark in that system card where it appears that the
00:01:40long context performance has seemingly taken a nosedive compared to Opus 4.6 when using
00:01:45a needle in a haystack test so I'm pretty curious how that's going to impact actual usage over
00:01:50time. Outside of the benchmarks there's also a few other notable improvements that might
00:01:54even change how you use Claude. The first one is that it's better instruction following
00:01:58which actually means that you might have unexpected results with prompts that you've already used
00:02:01before as older models interpreted instructions loosely or skipped parts whereas Opus 4.7 is
00:02:07really focused on taking instructions literally so you might actually have some prompt tweaking
00:02:11to do. Next it's got improved multimodal support so it can accept higher resolution images three
00:02:16times that of the older models so this should make it better at tasks like computer use and
00:02:20data extraction. It's memory use also improved so Opus 4.7 should be better at using file
00:02:25system based memory where it remembers important notes across long multi-session work and uses
00:02:30those to move on to new tasks that as a result need less upfront context. So maybe that will
00:02:34save me a few tokens which is pretty important now as the next change is to the tokenizer
00:02:39and thinking. Opus 4.7 uses an updated tokenizer that improves how the model processes text
00:02:45but it also means that the same input prompt can cost up to 35% more tokens and when you
00:02:49combine this with the fact that Opus 4.7 thinks more at higher effort levels this model is
00:02:54really going to burn through some tokens. To make this worse there's also a new extra
00:02:58high effort level and it's actually set as the default in-claw code so I highly recommend
00:03:02you go and test out the various effort levels and find the one that suits you best to see
00:03:05if you could possibly downgrade this without noticing an impact. For comparison the new
00:03:09extra high effort level uses roughly the same amount of tokens as Opus 4.6's max effort
00:03:14level and the Opus 4.7 high effort level actually outscores Opus 4.6's max effort level with
00:03:19less tokens used. So if you're already comfortable with what you had before I'd use that chart
00:03:24to compare because I know for me I'm probably going to change this to be using the high effort
00:03:27level in most cases. With the tldr of what's new out of the way I'm going to burn through
00:03:31my usage and test this. The first thing I'm going to check is is it better at UI design
00:03:35so I gave it a very simple prompt to create a cafe website with an index.html only and
00:03:40I'm using the max effort level on all of the models I'm testing so I'm going to try this
00:03:43out in Opus 4.7, 4.6, Gemini 3.1 and GPT 5.4. This is the result I got back from Opus 4.7
00:03:51and I think it looks pretty nice it's got a nice sort of cafe feel to it it's used a
00:03:55nice font it's picked up images from Unsplash here. Overall I can't really complain it's
00:03:59a pretty simple website has a nice menu section everything is actually responsive and overall
00:04:04yeah I'd say it looks pretty good. If we compare this to what Opus 4.6 gave me you can see it
00:04:09went for a bit of a different style here but it's got a similar font and a similar menu
00:04:12section and overall it's a little bit worse I would say just because it hasn't used a nice
00:04:16background here and this gradient is not a nice switch at all but still can't complain
00:04:20too much I'd say Opus 4.7 is only a bit of a step above this. Gemini 3.1 on the other
00:04:25hand I think gave me my best result at least this one is my favorite so let me know in the
00:04:29comments below what yours is I just really like that it's got this background that doesn't
00:04:33move when we scroll I think it's done really well with this image section here in the our
00:04:36story section the menu looks similar to the other ones but again I think this is nicely
00:04:40laid out and the same with the footer so I think 3.1 wins on this one for me. Coming
00:04:45in last place though is definitely GPT 5.4 this just has such a GPT look and feel to it
00:04:50it loves these sort of cards where it has a nice blur to them and it's just not a good
00:04:55cafe website in my opinion it just looks like every other GPT app that I have ever seen so
00:04:59Opus 4.7 is definitely good at UI and it will probably handle it even better given some more
00:05:04direction at the moment on design arena Opus 4.6 actually takes the lead for websites so
00:05:09I do expect that 4.7 will take its place. Now obviously that test was a pretty simple
00:05:13one so next I'm going to give them all a more advanced task you can see here in Claude code
00:05:17with Opus 4.6 I'm asking for a personal finance management dashboard that offers a detailed
00:05:21overview of an individual's financial health with a load of features that I have in the
00:05:25prompt here and I'm not giving it any indication of the stack that it should use it is going
00:05:30to pick all of that and start from scratch. Up first we have the result of Opus 4.7 and
00:05:34it did this all in a single prompt in around 20 minutes and my initial reaction is just
00:05:39wow this looks really good the UI is really clean it's got really nice charts here everything
00:05:44is laid out nicely it uses a good color scheme and to be honest with you there's not much
00:05:48that I would improve about this myself it has done a fantastic job on the UI side of things
00:05:53and it also has all of the individual pages that I asked for we can see all of our accounts
00:05:57we can see our transactions and our budgets we can't actually add any new budgets at the
00:06:02moment it seems that that isn't a feature and the same with the goals but we are able
00:06:05to add into our goals here and the numbers do go up and it does update the back end API
00:06:10which it built and the same thing goes for if we send money to people as well so if I
00:06:14just test paying for my Claude code subscription here this should send successfully and I can
00:06:17see it has been sent and back on the dashboard my net worth has been updated with that transaction
00:06:22so everything is working there and it is using a database on the back end and we also have
00:06:26it showing up in our recent transactions looking through the code they generated everything
00:06:30looks pretty good it used react and veet for my front end so the same thing I would have
00:06:34done and it also used react router maybe I would have used tan stack but it doesn't really
00:06:38matter they're both pretty good options in all of these you can see everything is laid
00:06:42out neatly we have all of our individual UI components overall the front end is just pretty
00:06:46well done the place where I will mark it off for is in the back end because we are using
00:06:51an express server there's nothing really wrong with that but I would have gone with something
00:06:54like bun maybe or hono for just how simple this app is and also the way that it's actually
00:06:59storing this data is all in memory so if I now shut down the back end service and start
00:07:04it up again it's going to load in the data from this seed script and this is just local
00:07:08arrays it didn't have any database to back this up to moving on to our opus 4.6 gave me
00:07:13I've got to say straight away opus 4.7 definitely did a better job when it comes to the UI design
00:07:18there's just something about this UI that I don't quite like I don't know if it's got a
00:07:21bit too much padding or if it's the fact that it's in light mode whereas the other one was
00:07:24in dark mode I just definitely prefer the opus 4.7 one overall it's got pretty similar components
00:07:29though you can see we've got the cards with our net worth we've got a net worth trend graph
00:07:33recent transactions and our financial goals and we also have the individual pages to track
00:07:38these as well besides the UI we can also test out some of the features so I'll add a new
00:07:42transaction here this one is going to be a hundred and fifty dollars for groceries it
00:07:46does look like we get an update here and also back on the dashboard my net worth updated
00:07:50as well so it does seem to be working there one place opus 4.6 might have actually be opus
00:07:544.7 in the single prompt is that I can add accounts here so I just added this account
00:07:58and the same thing goes for the goals and the budget so I also added the education budget
00:08:03so it looks like opus 4.6 added in a few more features but to be honest with you I just
00:08:07asked opus 4.7 to add them in for me obviously normally you wouldn't be doing a single prompt
00:08:12taking a look at the code opus 4.6 went down a similar route with a vreact application but
00:08:16one interesting thing that I've just noticed is this is using react 19 and react router
00:08:20dom 7 whereas opus 4.7 went with react 18 and also react router 6 even though I'm pretty
00:08:27sure opus 4.7 has the newer knowledge cutoff besides that another win for opus 4.6 is that
00:08:32it did use a database for the back end so it will be persisting it you can see it's using
00:08:36a sqlite one here and we do have some of the databases so that's definitely a win but where
00:08:40it loses is it seemingly used javascript for all of this project whereas opus 4.7 correctly
00:08:45used typescript next we have the result of gpt 5.4 and to be honest with you I have no
00:08:50idea what it's doing here this is not a usable ui it looks really bad in my opinion everything
00:08:55is really cluttered I don't like the font and I yeah it's I'm not really going to spend
00:08:59much time on this this just looks way worse than the clawed ones I can confirm though that
00:09:03it does work when we add in some money except it just refreshes the entire page as well it
00:09:07doesn't get much better in the code either seemingly gpt 5.4 just didn't want to start
00:09:11a full project from this so it's just gone with a very simple approach where we just have
00:09:14our index.html our javascript file and our styles and for the database that's also just
00:09:19a single javascript script as well it's not actually using a database it's doing it all
00:09:23in memory like opus 4.7 and again it's also gone with javascript for everything instead
00:09:28of typescript as for gemini 3.1 I'll be honest with you I had a lot of issues trying to get
00:09:32this app to run and actually had to send multiple follow-up prompts just because I was curious
00:09:36at what this actually looked like and it kind of looks exactly like the opus 4.6 one I don't
00:09:41know if they have the same training data when they were doing the ui but it's very similar
00:09:45and none of these features actually work and none of these tabs are clickable gemini 3.1
00:09:50probably did the worst even though 5.4 is up there just because of the way that it created
00:09:54the app I will say gemini 3.1 did actually try and take a good approach to this actually
00:09:59went with next.js instead of react router which is a pretty good idea because it means you
00:10:02can use the api server routes and this was a pretty simple app so I'm not opposed to doing
00:10:07that but I will say it did use prisma where I would have preferred something like drizzle
00:10:10these tests honestly surprised me because up until now I've been a pretty heavy codex user
00:10:15and I've moved away from claud code but opus 4.7 might just claw me back because it had
00:10:19a really nice ui design and most of the app seemed to work obviously it does come down
00:10:24to the prompting quality and I was giving quite a vague prompt on the stack I'd normally prompt
00:10:28with the exact things that I want but still I am pretty impressed with the result that
00:10:32we got here I'm curious what you think what's your model of choice at the moment let me
00:10:36know in the description down below where you're there subscribe and as always see you in the
00:10:49next one.

Key Takeaway

Opus 4.7 delivers superior UI design and coding accuracy through a literal instruction-following engine, though it increases operational costs by up to 35% per prompt due to its new tokenizer and high-effort thinking defaults.

Highlights

Opus 4.7 provides a 10% performance increase on the SWE Bench Pro benchmark and a 7% leap on verified coding tasks compared to version 4.6.

Input prompts can cost up to 35% more tokens due to an updated tokenizer, while the new default 'extra high' effort level in Claude Code uses roughly the same tokens as the previous version's maximum setting.

Visual processing capabilities allow for image resolutions three times higher than previous models, improving data extraction and computer-use tasks.

Instruction following is strictly literal in Opus 4.7, requiring users to adjust existing prompts that relied on the loose interpretation of previous versions.

Long-context performance in needle-in-a-haystack tests shows a significant decline compared to version 4.6, despite improvements in multi-session file system memory.

In a single-prompt 20-minute test for a financial dashboard, Opus 4.7 generated a functional React 18/TypeScript application with higher UI quality than GPT 5.4 or Gemini 3.1.

Timeline

Core Upgrades and Token Economics

  • Opus 4.7 introduces improved vision, self-verification, and more creative UI generation capabilities.
  • The updated tokenizer increases token usage by up to 35% for identical input prompts compared to older models.
  • Anthropic is using Opus 4.7 to test cyber safeguards required for the future release of the more powerful Mythos model class.

The release focuses on refining coding and vision performance while introducing a higher cost per interaction. The model processes text differently, meaning the 'token burn' is higher even before accounting for increased internal reasoning. These changes serve as a bridge toward Mythos, a model restricted from public release until cybersecurity risks are mitigated.

Benchmark Performance and Instruction Following

  • SWE Bench Pro scores increased by 10%, while cybersecurity scores slightly decreased due to artificial safety safeguards.
  • Needle-in-a-haystack tests reveal a performance regression in long-context retrieval compared to Opus 4.6.
  • The model interprets instructions literally, which may break prompts designed for the more flexible interpretations of previous versions.

Benchmark data shows consistent gains in software engineering tasks but highlights a specific weakness in long-context reliability. The shift toward literal instruction following means prompts must be more precise. Safety protocols are intentionally suppressing scores in specific high-risk categories like cybersecurity to prevent misuse.

Vision, Memory, and Effort Levels

  • Multimodal support now handles images with 3x the resolution of previous versions.
  • File system memory improvements allow the model to retain notes across multiple sessions to reduce upfront context needs.
  • Opus 4.7 High effort level outperforms Opus 4.6 Max effort level while using fewer tokens than the new Extra High default.

Higher resolution image support targets improvements in automated computer use and complex data extraction from visuals. Memory management now relies more on persistent file notes, potentially offsetting some token costs over long projects. Users can optimize costs by downgrading from the 'Extra High' default to 'High' effort without losing performance relative to the previous 4.6 model.

UI Design Comparison and Competition

  • Opus 4.7 creates responsive websites with integrated assets from Unsplash in a single prompt.
  • Gemini 3.1 produced the most aesthetically pleasing cafe website result in head-to-head testing.
  • GPT 5.4 remains behind in aesthetic variety, often defaulting to a repetitive card-based design language.

Testing across a simple cafe website prompt shows that while Opus 4.7 is a step above 4.6 in font choice and layout, Gemini 3.1 currently leads in visual composition for simple landing pages. GPT 5.4 shows the least progress in design creativity, maintaining a highly recognizable and generic style. Opus 4.7 remains a top contender for UI tasks due to its ability to handle responsive layouts and asset integration.

Full-Stack Development Stress Test

  • Opus 4.7 built a functional financial dashboard with React, Vite, and TypeScript in 20 minutes.
  • The model correctly utilized TypeScript for the entire project, whereas GPT 5.4 and Opus 4.6 defaulted to JavaScript or older React versions.
  • Gemini 3.1 attempted a Next.js architecture but failed to produce clickable tabs or working features in a single prompt.

In a complex task involving a financial dashboard with a backend API, Opus 4.7 provided the best balance of UI quality and code structure. Although it used in-memory storage instead of a persistent SQLite database like 4.6, its use of modern TypeScript made it the most professional output. GPT 5.4 failed to generate a full project structure, providing only basic HTML and CSS files, while Gemini 3.1's more advanced Next.js approach resulted in a broken user interface.

Community Posts

View all posts