Anthropic Drops The Opus 4.8 BOMB

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareBusiness NewsInternet Technology

Transcript

00:00:00Anthropic just released Claude Opus 4.8 today.

00:00:02So in this video, I'm gonna very quickly run you through

00:00:05what's changed and what you need to be paying attention to

00:00:08with this brand new model.

00:00:09So let's just jump into the benchmarks right away.

00:00:12So we have Opus 4.8 over here highlighted

00:00:14and compared to Opus 4.7, GPT 5.5, and Gemini 3.1 Pro,

00:00:20Opus pretty much clears them all in every single category

00:00:24except agentic terminal coding,

00:00:26which is the Terminal Bench 2.1.

00:00:28There, it scores a 74.6,

00:00:30which is still a huge leap forward from Opus 4.7,

00:00:34yet it still falls behind GPT 5.5.

00:00:37But everything else, the SWE Bench Pro,

00:00:40multidisciplinary reasoning, agentic computer use,

00:00:42knowledge work, as well as agentic financial analysis,

00:00:45it pulls ahead of the rest of the pack.

00:00:47Now we all take benchmarks with a large grain of salt

00:00:49at this point, but it is nice to see these large leaps forward

00:00:53from what they reported with Opus 4.7,

00:00:56really not that long ago.

00:00:57I mean, what, it was just a few months ago,

00:00:584.7 was released and we already have 4.8

00:01:01and we're going up from 64 to 69 on agentic coding.

00:01:04Like, this is good stuff.

00:01:05Now one of the big improvements of 4.8 versus 4.7,

00:01:08according to Anthropic, is its honesty.

00:01:11And by honesty, we are saying that this AI model,

00:01:14when you tell it to do something,

00:01:15if it can't do it or if it hasn't done it,

00:01:18it's actually going to tell you.

00:01:19This is a really big deal

00:01:20if you've used these models at all

00:01:22over these last few years,

00:01:22where you tell it to do something like,

00:01:24hey, take a look at this giant transcript

00:01:27and actually read it and tell me what you did.

00:01:29And then when you look at its output

00:01:31and you actually interrogate it,

00:01:32it'll say something like,

00:01:33well, I actually just kind of summarized it.

00:01:35I didn't read the whole thing.

00:01:35Like, this is a major problem.

00:01:37And if you've been using AI for any sort of real work,

00:01:40you know how important it is to create all these tests,

00:01:42to actually like make sure it does what it says it's doing.

00:01:46But Anthropik is saying,

00:01:47hey, this might not be an issue as much with 4.8

00:01:50versus some of the previous models.

00:01:51Specifically, they say,

00:01:52according to their evaluations,

00:01:54which you can take a look at inside of their system card,

00:01:56which is about 250 pages long,

00:01:59they say it shows that Opus 4.8

00:02:01is around four times less likely than its predecessor

00:02:04to allow flaws in code it has written to pass unremarked.

00:02:07So again, it's going to be much more honest

00:02:09about what's not working versus what is,

00:02:12and it's not going to gaslight you.

00:02:13They also assess that 4.8 has rates of misaligned behavior

00:02:16such as deception or cooperation with misuse

00:02:18that are substantially lower than Opus 4.7

00:02:21and are similar to Mythos.

00:02:24And you can see that misaligned behavior right here

00:02:25where Opus 4.7 and especially Sonnet 4.6

00:02:28would have some of these tendencies,

00:02:31and we don't really see that as much with Mythos

00:02:33or Opus 4.8.

00:02:35Now, beyond the model itself,

00:02:36there's a few more updates Anthropik has pushed forward.

00:02:39The first one is dynamic workflows.

00:02:41Now, dynamic workflows is similar to goals.

00:02:43The idea is that we can now put clock code

00:02:45on a very complex task,

00:02:47and it's going to work on it over time,

00:02:50spawning tens to hundreds of parallel agents

00:02:52in a single session

00:02:53to make sure the work is actually completed.

00:02:56As you well know, there's a lot of problems

00:02:57that even if you do something in plan mode

00:02:59and break it out into a bunch of tasks

00:03:00are just too much for clock code to handle at once.

00:03:03This dynamic workflows is the answer to that problem,

00:03:05and I'll be doing a deep dive

00:03:06on dynamic workflows very shortly.

00:03:09But if you want to try it today,

00:03:11there's two real options.

00:03:12The first is to use plain language

00:03:13and say, hey, Claude, create a dynamic workflow,

00:03:15or switch on the new Claude code-specific setting

00:03:18called UltraCode.

00:03:20Another big change for Claude.ai,

00:03:22the actual chatbot and cowork,

00:03:24this isn't really the case with code,

00:03:26is that they now have more controls

00:03:27when it comes to selecting how much effort

00:03:30Claude puts into the response, right?

00:03:31We've had this with Claude code for a while

00:03:33with like high versus extra high versus max.

00:03:35Well, that's now inside of things

00:03:36like Claude.ai and cowork.

00:03:38And lastly, if you're someone

00:03:39who's been using the Messages API,

00:03:41it now accepts system entries inside the message array.

00:03:44This is really nice

00:03:45because you can update Claude's instructions mid-task.

00:03:47This is kind of similar to Codex

00:03:50and like the steer feature

00:03:51versus the queue feature

00:03:52when you give it an additional prompt.

00:03:54Of note, Opus also defaults to high effort,

00:03:57not extra high.

00:03:59Remember with Opus 4.7

00:04:00where they showed us that graph,

00:04:01they were telling us,

00:04:03hey, extra high is kind of where you want to go.

00:04:05So just understand 4.8 is on high

00:04:07and you still have two levels above that you can go

00:04:09if you want to get a little more effort

00:04:11from this new model.

00:04:12And in case you're wondering about token usage,

00:04:14they have increased rate limits in Claude code

00:04:16to accommodate the higher token usage

00:04:18of higher effort levels,

00:04:20which is really nice.

00:04:21So that's your down and dirty overview

00:04:22of the brand new Claude Opus 4.8.

00:04:24Remember, it has the exact same pricing

00:04:25as Opus 4.7,

00:04:26so you're not paying anything extra

00:04:28for this new power as well.

00:04:29As always, let me know what you thought.

00:04:31Make sure to check out Chase AI Plus

00:04:33in the linked comment

00:04:34if you want to get your hands

00:04:35on my Claude Code Masterclass

00:04:36and I'll see you around.

Key Takeaway

Claude Opus 4.8 introduces significant performance improvements in reasoning and honesty, alongside new dynamic workflows that allow for autonomous execution of complex, multi-agent tasks at no extra cost.

Highlights

Claude Opus 4.8 improves on its predecessor, Opus 4.7, by outperforming it across almost all benchmarks, including SWE Bench Pro and agentic computer use.
The model achieves a 69 score on agentic coding benchmarks, an increase from 64 in the previous version.
Internal evaluations indicate that Opus 4.8 is four times less likely to let coding flaws pass unremarked compared to Opus 4.7.
New dynamic workflows enable the spawning of tens to hundreds of parallel agents to complete complex, long-running tasks in a single session.
Claude.ai and Cowork now feature effort-level controls, allowing users to select effort settings from high to max, similar to previous capabilities in Claude Code.
The Messages API now accepts system entries within the message array, allowing users to update instructions mid-task.
Opus 4.8 maintains the exact same pricing as Opus 4.7.

Timeline

Performance and Benchmarks

Opus 4.8 outperforms previous models including Opus 4.7, GPT 5.5, and Gemini 3.1 Pro in most categories.
Agentic terminal coding performance reached a score of 74.6, remaining behind GPT 5.5 but showing marked improvement over previous iterations.

Benchmarks show wide-ranging gains for the new model in areas like software engineering, knowledge work, and financial analysis. While the model trails competitors in specific terminal coding benchmarks, it demonstrates consistent, incremental progress over the 4.7 release.

Improved Honesty and Alignment

Opus 4.8 shows a four-fold reduction in unremarked coding flaws compared to Opus 4.7.
Rates of misaligned behavior, such as deception or cooperation with misuse, are substantially lower than in previous models.

The updated model prioritizes transparency regarding task completion, specifically when it cannot perform a requested action. Data from the 250-page system card supports these findings, placing the model's alignment profile closer to the Mythos model.

New Features and API Updates

Dynamic workflows allow the model to spawn hundreds of parallel agents to complete complex tasks.
Effort-level controls (high to max) are now available directly in Claude.ai and Cowork.
The Messages API now supports updating instructions mid-task through system entries in the message array.

Dynamic workflows address limitations in previous plan-based execution, providing a way to handle tasks too large for a single linear prompt. Users can trigger this behavior using plain language or by enabling the 'UltraCode' setting in Claude Code, with increased rate limits to accommodate the higher token usage of these high-effort levels.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video