Anthropic Drops The Opus 4.8 BOMB

CChase AI
Computing/SoftwareBusiness NewsInternet Technology

Transcript

00:00:00Anthropic just released Claude Opus 4.8 today.
00:00:02So in this video, I'm gonna very quickly run you through
00:00:05what's changed and what you need to be paying attention to
00:00:08with this brand new model.
00:00:09So let's just jump into the benchmarks right away.
00:00:12So we have Opus 4.8 over here highlighted
00:00:14and compared to Opus 4.7, GPT 5.5, and Gemini 3.1 Pro,
00:00:20Opus pretty much clears them all in every single category
00:00:24except agentic terminal coding,
00:00:26which is the Terminal Bench 2.1.
00:00:28There, it scores a 74.6,
00:00:30which is still a huge leap forward from Opus 4.7,
00:00:34yet it still falls behind GPT 5.5.
00:00:37But everything else, the SWE Bench Pro,
00:00:40multidisciplinary reasoning, agentic computer use,
00:00:42knowledge work, as well as agentic financial analysis,
00:00:45it pulls ahead of the rest of the pack.
00:00:47Now we all take benchmarks with a large grain of salt
00:00:49at this point, but it is nice to see these large leaps forward
00:00:53from what they reported with Opus 4.7,
00:00:56really not that long ago.
00:00:57I mean, what, it was just a few months ago,
00:00:584.7 was released and we already have 4.8
00:01:01and we're going up from 64 to 69 on agentic coding.
00:01:04Like, this is good stuff.
00:01:05Now one of the big improvements of 4.8 versus 4.7,
00:01:08according to Anthropic, is its honesty.
00:01:11And by honesty, we are saying that this AI model,
00:01:14when you tell it to do something,
00:01:15if it can't do it or if it hasn't done it,
00:01:18it's actually going to tell you.
00:01:19This is a really big deal
00:01:20if you've used these models at all
00:01:22over these last few years,
00:01:22where you tell it to do something like,
00:01:24hey, take a look at this giant transcript
00:01:27and actually read it and tell me what you did.
00:01:29And then when you look at its output
00:01:31and you actually interrogate it,
00:01:32it'll say something like,
00:01:33well, I actually just kind of summarized it.
00:01:35I didn't read the whole thing.
00:01:35Like, this is a major problem.
00:01:37And if you've been using AI for any sort of real work,
00:01:40you know how important it is to create all these tests,
00:01:42to actually like make sure it does what it says it's doing.
00:01:46But Anthropik is saying,
00:01:47hey, this might not be an issue as much with 4.8
00:01:50versus some of the previous models.
00:01:51Specifically, they say,
00:01:52according to their evaluations,
00:01:54which you can take a look at inside of their system card,
00:01:56which is about 250 pages long,
00:01:59they say it shows that Opus 4.8
00:02:01is around four times less likely than its predecessor
00:02:04to allow flaws in code it has written to pass unremarked.
00:02:07So again, it's going to be much more honest
00:02:09about what's not working versus what is,
00:02:12and it's not going to gaslight you.
00:02:13They also assess that 4.8 has rates of misaligned behavior
00:02:16such as deception or cooperation with misuse
00:02:18that are substantially lower than Opus 4.7
00:02:21and are similar to Mythos.
00:02:24And you can see that misaligned behavior right here
00:02:25where Opus 4.7 and especially Sonnet 4.6
00:02:28would have some of these tendencies,
00:02:31and we don't really see that as much with Mythos
00:02:33or Opus 4.8.
00:02:35Now, beyond the model itself,
00:02:36there's a few more updates Anthropik has pushed forward.
00:02:39The first one is dynamic workflows.
00:02:41Now, dynamic workflows is similar to goals.
00:02:43The idea is that we can now put clock code
00:02:45on a very complex task,
00:02:47and it's going to work on it over time,
00:02:50spawning tens to hundreds of parallel agents
00:02:52in a single session
00:02:53to make sure the work is actually completed.
00:02:56As you well know, there's a lot of problems
00:02:57that even if you do something in plan mode
00:02:59and break it out into a bunch of tasks
00:03:00are just too much for clock code to handle at once.
00:03:03This dynamic workflows is the answer to that problem,
00:03:05and I'll be doing a deep dive
00:03:06on dynamic workflows very shortly.
00:03:09But if you want to try it today,
00:03:11there's two real options.
00:03:12The first is to use plain language
00:03:13and say, hey, Claude, create a dynamic workflow,
00:03:15or switch on the new Claude code-specific setting
00:03:18called UltraCode.
00:03:20Another big change for Claude.ai,
00:03:22the actual chatbot and cowork,
00:03:24this isn't really the case with code,
00:03:26is that they now have more controls
00:03:27when it comes to selecting how much effort
00:03:30Claude puts into the response, right?
00:03:31We've had this with Claude code for a while
00:03:33with like high versus extra high versus max.
00:03:35Well, that's now inside of things
00:03:36like Claude.ai and cowork.
00:03:38And lastly, if you're someone
00:03:39who's been using the Messages API,
00:03:41it now accepts system entries inside the message array.
00:03:44This is really nice
00:03:45because you can update Claude's instructions mid-task.
00:03:47This is kind of similar to Codex
00:03:50and like the steer feature
00:03:51versus the queue feature
00:03:52when you give it an additional prompt.
00:03:54Of note, Opus also defaults to high effort,
00:03:57not extra high.
00:03:59Remember with Opus 4.7
00:04:00where they showed us that graph,
00:04:01they were telling us,
00:04:03hey, extra high is kind of where you want to go.
00:04:05So just understand 4.8 is on high
00:04:07and you still have two levels above that you can go
00:04:09if you want to get a little more effort
00:04:11from this new model.
00:04:12And in case you're wondering about token usage,
00:04:14they have increased rate limits in Claude code
00:04:16to accommodate the higher token usage
00:04:18of higher effort levels,
00:04:20which is really nice.
00:04:21So that's your down and dirty overview
00:04:22of the brand new Claude Opus 4.8.
00:04:24Remember, it has the exact same pricing
00:04:25as Opus 4.7,
00:04:26so you're not paying anything extra
00:04:28for this new power as well.
00:04:29As always, let me know what you thought.
00:04:31Make sure to check out Chase AI Plus
00:04:33in the linked comment
00:04:34if you want to get your hands
00:04:35on my Claude Code Masterclass
00:04:36and I'll see you around.

Key Takeaway

Claude Opus 4.8 introduces significant performance improvements in reasoning and honesty, alongside new dynamic workflows that allow for autonomous execution of complex, multi-agent tasks at no extra cost.

Highlights

  • Claude Opus 4.8 improves on its predecessor, Opus 4.7, by outperforming it across almost all benchmarks, including SWE Bench Pro and agentic computer use.

  • The model achieves a 69 score on agentic coding benchmarks, an increase from 64 in the previous version.

  • Internal evaluations indicate that Opus 4.8 is four times less likely to let coding flaws pass unremarked compared to Opus 4.7.

  • New dynamic workflows enable the spawning of tens to hundreds of parallel agents to complete complex, long-running tasks in a single session.

  • Claude.ai and Cowork now feature effort-level controls, allowing users to select effort settings from high to max, similar to previous capabilities in Claude Code.

  • The Messages API now accepts system entries within the message array, allowing users to update instructions mid-task.

  • Opus 4.8 maintains the exact same pricing as Opus 4.7.

Timeline

Performance and Benchmarks

  • Opus 4.8 outperforms previous models including Opus 4.7, GPT 5.5, and Gemini 3.1 Pro in most categories.
  • Agentic terminal coding performance reached a score of 74.6, remaining behind GPT 5.5 but showing marked improvement over previous iterations.

Benchmarks show wide-ranging gains for the new model in areas like software engineering, knowledge work, and financial analysis. While the model trails competitors in specific terminal coding benchmarks, it demonstrates consistent, incremental progress over the 4.7 release.

Improved Honesty and Alignment

  • Opus 4.8 shows a four-fold reduction in unremarked coding flaws compared to Opus 4.7.
  • Rates of misaligned behavior, such as deception or cooperation with misuse, are substantially lower than in previous models.

The updated model prioritizes transparency regarding task completion, specifically when it cannot perform a requested action. Data from the 250-page system card supports these findings, placing the model's alignment profile closer to the Mythos model.

New Features and API Updates

  • Dynamic workflows allow the model to spawn hundreds of parallel agents to complete complex tasks.
  • Effort-level controls (high to max) are now available directly in Claude.ai and Cowork.
  • The Messages API now supports updating instructions mid-task through system entries in the message array.

Dynamic workflows address limitations in previous plan-based execution, providing a way to handle tasks too large for a single linear prompt. Users can trigger this behavior using plain language or by enabling the 'UltraCode' setting in Claude Code, with increased rate limits to accommodate the higher token usage of these high-effort levels.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video