Claude Dropped Opus 4.7 and It's Not Even Close

CChase AI
Computing/SoftwareBusiness NewsConsumer ElectronicsInternet Technology

Transcript

00:00:00So opus 4.7 just released and by the numbers,
00:00:04this is a massive upgrade. So let's dive in. So first things first,
00:00:08the benchmarks. Now they do show mythos over here on the right,
00:00:12just to tease us about things that do exist.
00:00:15But what I really want to pay attention to is 4.7 versus 4.6 because who knows
00:00:20when mythos is going to be available and by the numbers,
00:00:23this is a very solid leap forward, especially in things like coding.
00:00:28If we take a look at agentic coding, we see a jump from 53 to 64,
00:00:32from 80 to 87,
00:00:34and then from 65 to 69 on the three big tests being sweet bench
00:00:39pro sweet bench verified in terminal bench 2.0.
00:00:42The only places that we see opus 4.7 benchmarks,
00:00:46not on top of all the other models,
00:00:49except for mythos is agentic search where we look at GPT 5.4.
00:00:54Is it 89.3 versus opus 4.7,
00:00:57which oddly enough has dropped versus 4.6, which, you know,
00:01:01when you see things like that,
00:01:02where they show benchmarks where it's gone down from opus 4.6,
00:01:06you wonder if they kind of just insert those. It's like, Oh no,
00:01:08these benchmarks are actually legit guys. We wouldn't lie about this. See,
00:01:11see this thing. Um,
00:01:12but 5.4 is ahead in agentic search and you also see it ahead in graduate level
00:01:17reasoning. Now, another area we see a massive improvement is visual reasoning.
00:01:21So we jump from 69 to 82,
00:01:25and that might have something to do with the fact that this model has way better
00:01:29vision.
00:01:29So they are telling us that the images that you put into opus 4.7 are three X,
00:01:34the resolution now, which is huge.
00:01:36If you're doing anything with like diagrams or small text,
00:01:38and we see those same sort of numbers reflected here in these graphs.
00:01:42So improvements in knowledge, work, vision, huge jump in document reasoning,
00:01:4657.1 to 80.6, which is a huge plus.
00:01:50If you're someone who uses something like cowork,
00:01:52you're using this in an office scenario and all as you do all day is feed it
00:01:55documents. Long context reasoning is also a big one.
00:01:57We constantly harp on this channel about context rot and the idea that we need to
00:02:02be very focused on session management. I don't think that changes at all. I mean,
00:02:07going from 71 to 75 is great.
00:02:09I don't think you should change how aggressively you clear IE anytime you're at 20%
00:02:13or 25% of the context window, you should be clearing, but this is an improvement.
00:02:17We'd love to see this. And this one is also interesting.
00:02:19This coding benchmark that has to do with multimodal. So they're coding,
00:02:22but this also includes things where they're throwing it context that has stuff
00:02:25like images. And I don't think this is any surprise.
00:02:28And I think a lot of that has to do with the resolution.
00:02:30Now besides the model itself did a few more updates.
00:02:32The biggest one is more effort control. So now there is a level X high,
00:02:37probably stole that from open AI between high and max.
00:02:40And on top of that cloud code now defaults to extra high.
00:02:44I think that's probably in response to a lot of people claiming that Opus 4.6 was
00:02:48nerfed. And then Boris Churney, the creator of Opus, well, not creative Opus,
00:02:52creative cloud code came out and said, well,
00:02:54actually we moved the default reasoning level, the default effort level,
00:02:58the medium. So the fact that came out with X high,
00:03:01I think is a response to that in order to make it quote unquote better and
00:03:05try harder yet not pushing people to max because then it swings to the other side
00:03:10and everyone complains that their usage is filling up. And remember,
00:03:12if you want to change that,
00:03:13all you need to do is do forward slash effort and then set your level.
00:03:16The higher resolution is also on the API.
00:03:19And then they've also released the new forward slash ultra review slash command.
00:03:24So it gets a dedicated review session on top of that.
00:03:28They've extended auto mode as well. And if you don't know about auto mode,
00:03:31it's basically just a alternative to dangerously skip permissions. Now,
00:03:34one thing they note here is that Opus 4.7 is going to use more tokens
00:03:39than 4.6.
00:03:40So they explicitly state that Opus 4.7 uses an updated tokenizer and improves how
00:03:45it processes text, but that that increases the amount of tokens on the input,
00:03:50roughly one to 1.35 times, depending on the content type.
00:03:54And then secondly, Opus 4.7 thinks more at higher effort levels.
00:03:58So not remember that because they're setting the default effort to extra high
00:04:03when before it was on medium and Opus 4.7 uses more tokens.
00:04:07So if you've been on medium this whole time,
00:04:09you never changed it and you were already hitting usage rates or usage limits on
00:04:134.6 be wary of this. Understand that you could definitely run into usage issues.
00:04:18If you're someone who's already doing that,
00:04:19because now it's going to use even more tokens.
00:04:21What's also interesting is that they've removed extended thinking as well.
00:04:25And if you want to read more and get kind of a deep dive on this migration,
00:04:28they put out an entire thing in the documentation.
00:04:30So all at all looks like a really solid upgrade.
00:04:32And I'm excited to jump in there and test it myself.

Key Takeaway

Opus 4.7 delivers significant performance gains in coding and visual reasoning through 3x higher image resolution and a new 'Extra High' default effort level, though it consumes up to 35% more tokens per request.

Highlights

Opus 4.7 increases image resolution by 3x compared to version 4.6.

Agentic coding performance rose from 53 to 64 on SWE-bench Pro and from 80 to 87 on SWE-bench Verified.

Document reasoning scores jumped from 57.1 to 80.6, aiding office-related document analysis tasks.

The default effort level for Claude Code is now set to a new 'Extra High' setting.

Token usage increases by 1 to 1.35 times due to an updated tokenizer and higher default reasoning levels.

Visual reasoning benchmarks improved from 69 to 82 because of the enhanced image processing capabilities.

Timeline

Core Benchmarks and Performance Gains

  • Agentic coding scores reached 64 on SWE-bench Pro and 87 on SWE-bench Verified.
  • GPT 5.4 maintains a lead over Opus 4.7 in agentic search and graduate-level reasoning benchmarks.
  • Terminal Bench 2.0 scores improved from 65 to 69.

Performance leaps are most visible in coding-related tasks. While most metrics show growth, agentic search and graduate reasoning remain areas where competitors hold a slight edge. The small dip in agentic search scores compared to version 4.6 suggests a rigorous and transparent benchmarking process.

Vision and Document Reasoning Enhancements

  • Image resolution in Opus 4.7 is 3x higher than previous iterations.
  • Document reasoning benchmarks climbed from 57.1 to 80.6.
  • Multimodal coding benchmarks show a direct positive correlation with increased visual resolution.

Higher resolution allows the model to process small text and complex diagrams more accurately. This improvement translates directly to better performance in office scenarios where users feed the model numerous documents. Long context reasoning also saw a modest increase from 71 to 75, though session management remains necessary to avoid context rot.

Effort Controls and Interface Updates

  • A new 'Extra High' effort level sits between the existing 'High' and 'Max' settings.
  • Claude Code now defaults to the Extra High effort level instead of Medium.
  • A new /ultra_review command provides a dedicated session for code review.

The introduction of the Extra High level addresses user feedback regarding previous perceived performance drops. This setting encourages more thorough reasoning without hitting the usage limits as quickly as the Max setting. Users can manually adjust these settings using the /effort command.

Token Usage and Migration Details

  • The updated tokenizer increases input token counts by 1 to 1.35 times.
  • Higher effort levels lead to increased 'thinking' and higher token consumption.
  • Extended thinking features have been removed in this version.

The combination of a new tokenizer and higher default effort levels means users will reach rate limits faster than in version 4.6. This is particularly relevant for users who previously relied on the default Medium setting. Detailed migration steps and documentation are available for those transitioning API workflows.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video