Transcript
00:00:00Anthropic just released Claude Opus 4.8 today.
00:00:02So in this video, I'm gonna very quickly run you through
00:00:05what's changed and what you need to be paying attention to
00:00:08with this brand new model.
00:00:09So let's just jump into the benchmarks right away.
00:00:12So we have Opus 4.8 over here highlighted
00:00:14and compared to Opus 4.7, GPT 5.5, and Gemini 3.1 Pro,
00:00:20Opus pretty much clears them all in every single category
00:00:24except agentic terminal coding,
00:00:26which is the Terminal Bench 2.1.
00:00:28There, it scores a 74.6,
00:00:30which is still a huge leap forward from Opus 4.7,
00:00:34yet it still falls behind GPT 5.5.
00:00:37But everything else, the SWE Bench Pro,
00:00:40multidisciplinary reasoning, agentic computer use,
00:00:42knowledge work, as well as agentic financial analysis,
00:00:45it pulls ahead of the rest of the pack.
00:00:47Now we all take benchmarks with a large grain of salt
00:00:49at this point, but it is nice to see these large leaps forward
00:00:53from what they reported with Opus 4.7,
00:00:56really not that long ago.
00:00:57I mean, what, it was just a few months ago,
00:00:584.7 was released and we already have 4.8
00:01:01and we're going up from 64 to 69 on agentic coding.
00:01:04Like, this is good stuff.
00:01:05Now one of the big improvements of 4.8 versus 4.7,
00:01:08according to Anthropic, is its honesty.
00:01:11And by honesty, we are saying that this AI model,
00:01:14when you tell it to do something,
00:01:15if it can't do it or if it hasn't done it,
00:01:18it's actually going to tell you.
00:01:19This is a really big deal
00:01:20if you've used these models at all
00:01:22over these last few years,
00:01:22where you tell it to do something like,
00:01:24hey, take a look at this giant transcript
00:01:27and actually read it and tell me what you did.
00:01:29And then when you look at its output
00:01:31and you actually interrogate it,
00:01:32it'll say something like,
00:01:33well, I actually just kind of summarized it.
00:01:35I didn't read the whole thing.
00:01:35Like, this is a major problem.
00:01:37And if you've been using AI for any sort of real work,
00:01:40you know how important it is to create all these tests,
00:01:42to actually like make sure it does what it says it's doing.
00:01:46But Anthropik is saying,
00:01:47hey, this might not be an issue as much with 4.8
00:01:50versus some of the previous models.
00:01:51Specifically, they say,
00:01:52according to their evaluations,
00:01:54which you can take a look at inside of their system card,
00:01:56which is about 250 pages long,
00:01:59they say it shows that Opus 4.8
00:02:01is around four times less likely than its predecessor
00:02:04to allow flaws in code it has written to pass unremarked.
00:02:07So again, it's going to be much more honest
00:02:09about what's not working versus what is,
00:02:12and it's not going to gaslight you.
00:02:13They also assess that 4.8 has rates of misaligned behavior
00:02:16such as deception or cooperation with misuse
00:02:18that are substantially lower than Opus 4.7
00:02:21and are similar to Mythos.
00:02:24And you can see that misaligned behavior right here
00:02:25where Opus 4.7 and especially Sonnet 4.6
00:02:28would have some of these tendencies,
00:02:31and we don't really see that as much with Mythos
00:02:33or Opus 4.8.
00:02:35Now, beyond the model itself,
00:02:36there's a few more updates Anthropik has pushed forward.
00:02:39The first one is dynamic workflows.
00:02:41Now, dynamic workflows is similar to goals.
00:02:43The idea is that we can now put clock code
00:02:45on a very complex task,
00:02:47and it's going to work on it over time,
00:02:50spawning tens to hundreds of parallel agents
00:02:52in a single session
00:02:53to make sure the work is actually completed.
00:02:56As you well know, there's a lot of problems
00:02:57that even if you do something in plan mode
00:02:59and break it out into a bunch of tasks
00:03:00are just too much for clock code to handle at once.
00:03:03This dynamic workflows is the answer to that problem,
00:03:05and I'll be doing a deep dive
00:03:06on dynamic workflows very shortly.
00:03:09But if you want to try it today,
00:03:11there's two real options.
00:03:12The first is to use plain language
00:03:13and say, hey, Claude, create a dynamic workflow,
00:03:15or switch on the new Claude code-specific setting
00:03:18called UltraCode.
00:03:20Another big change for Claude.ai,
00:03:22the actual chatbot and cowork,
00:03:24this isn't really the case with code,
00:03:26is that they now have more controls
00:03:27when it comes to selecting how much effort
00:03:30Claude puts into the response, right?
00:03:31We've had this with Claude code for a while
00:03:33with like high versus extra high versus max.
00:03:35Well, that's now inside of things
00:03:36like Claude.ai and cowork.
00:03:38And lastly, if you're someone
00:03:39who's been using the Messages API,
00:03:41it now accepts system entries inside the message array.
00:03:44This is really nice
00:03:45because you can update Claude's instructions mid-task.
00:03:47This is kind of similar to Codex
00:03:50and like the steer feature
00:03:51versus the queue feature
00:03:52when you give it an additional prompt.
00:03:54Of note, Opus also defaults to high effort,
00:03:57not extra high.
00:03:59Remember with Opus 4.7
00:04:00where they showed us that graph,
00:04:01they were telling us,
00:04:03hey, extra high is kind of where you want to go.
00:04:05So just understand 4.8 is on high
00:04:07and you still have two levels above that you can go
00:04:09if you want to get a little more effort
00:04:11from this new model.
00:04:12And in case you're wondering about token usage,
00:04:14they have increased rate limits in Claude code
00:04:16to accommodate the higher token usage
00:04:18of higher effort levels,
00:04:20which is really nice.
00:04:21So that's your down and dirty overview
00:04:22of the brand new Claude Opus 4.8.
00:04:24Remember, it has the exact same pricing
00:04:25as Opus 4.7,
00:04:26so you're not paying anything extra
00:04:28for this new power as well.
00:04:29As always, let me know what you thought.
00:04:31Make sure to check out Chase AI Plus
00:04:33in the linked comment
00:04:34if you want to get your hands
00:04:35on my Claude Code Masterclass
00:04:36and I'll see you around.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video