Transcript
00:00:00Claude Mythos is finally here.
00:00:01Anthropic just dropped a new model called Fable 5,
00:00:03which is a Mythos-class model,
00:00:05just with loads of safeguards built in,
00:00:07but it exceeds any model they've ever released,
00:00:09and possibly anyone.
00:00:11It is state-of-the-art on nearly every single benchmark.
00:00:13Obviously, though, this is definitely going to cost you,
00:00:16and they've done something a little interesting
00:00:17with the pricing here
00:00:18that I don't think too many people are going to be happy with.
00:00:25Now, normally, I don't like to spend too long
00:00:27on the benchmarks, but this table is kind of insane.
00:00:30The jumps that this model is making
00:00:31on some of these benchmarks,
00:00:32and the fact that it's ahead on nearly every single one.
00:00:35You can see it has a 10% jump at Argentic Coding
00:00:37on SWE Bench Pro,
00:00:39and it's basically 20% ahead of GPT 5.5,
00:00:42and it made similar leaps on the Frontier Code benchmark.
00:00:44Frontier Code is actually a new benchmark from Cognition,
00:00:47the guys behind Devin,
00:00:48that essentially tests would have maintained
00:00:49that actually merged the code that this model produces.
00:00:52On this chart, you can see Fable 5 is ahead
00:00:54of every other model,
00:00:55even at a medium reasoning effort,
00:00:57but I also think you can see this model
00:00:58is going to be super expensive.
00:01:00It's also marginally better at computer use,
00:01:02not a massive leap,
00:01:03and the same goes for Terminal Bench at the bottom,
00:01:05but again, as you can see,
00:01:06it's a leader in nearly every single category.
00:01:09One of the biggest things, though,
00:01:10that's becoming more and more relevant
00:01:11is long-running tasks.
00:01:12Fable 5 can apparently work for longer
00:01:14than any other model,
00:01:15and they had Stripe test this out,
00:01:17and apparently it performed a codebase-wide migration
00:01:18of a 50 million line Ruby codebase
00:01:21in a single day.
00:01:22Probably helped out by the fact that it's gone
00:01:24much better at memory and long context as well.
00:01:26It can apparently stay focused across millions
00:01:28of tokens in long-running tasks,
00:01:29and it improves its own outputs
00:01:31by using its own notes.
00:01:32Now, besides just coding,
00:01:33its vision capabilities are pretty awesome as well.
00:01:36Apparently, it can beat Pokemon Fire Red
00:01:37with a minimal vision-only harness now,
00:01:39whereas previously they had to give this additional tools,
00:01:42and it still barely beat it,
00:01:43but now it has no issue.
00:01:45It will also apparently happily one-shot a website
00:01:47from a screenshot.
00:01:48I actually tested this out using the Linear website,
00:01:50and it genuinely got a bit confusing for me
00:01:52which one is which here,
00:01:53but the one on the right is the one
00:01:55that Fable 5 generated
00:01:56from just a screenshot of the Linear website.
00:01:58It didn't use web search or anything like that,
00:02:00I just gave it a full screenshot of this webpage,
00:02:02and I would say it's done a pretty awesome job at it.
00:02:05All of the screenshots, everything,
00:02:06have been generated with code,
00:02:08and you can see it's done a very, very good job.
00:02:10It's things like the SVG animations
00:02:12that aren't going to be perfect,
00:02:14but overall, I would say I'm pretty happy
00:02:15with the way that it's recreated this website,
00:02:18and it's nailed pretty much every section,
00:02:20or at least got me to a point
00:02:21where I could then iterate on it
00:02:22to get it exactly how I want.
00:02:24While we're here,
00:02:24I also decided to test these models
00:02:25on building me a front-end and a back-end
00:02:27for a finance dashboard app
00:02:28from a completely empty folder in one shot,
00:02:31and this is what Fable 5 gave me.
00:02:33I have tested everything,
00:02:34everything is working,
00:02:35it talks to the API,
00:02:37and overall, the design does look really nice.
00:02:39It is really usable,
00:02:40but it is that aesthetic
00:02:41that Claude models seem to be giving recently.
00:02:43We can see that in the result
00:02:44that Opus 4.8 gave me as well.
00:02:45Again, I think this site looks really nice,
00:02:47and to be honest with you,
00:02:48I'd argue this looks better than the Fable 5 one,
00:02:50but again, it has that aesthetic
00:02:51that Claude has been trained on,
00:02:53but that is also my fault.
00:02:54I didn't prompt this to go in any particular design.
00:02:56I'm sure if I did,
00:02:57it would have done a great job.
00:02:58If we compare this to what GPT 5.5 gave me,
00:03:00though,
00:03:01you can see it's just not even close.
00:03:03This was from a single prompt,
00:03:04the exact same prompt,
00:03:05and they're just miles behind in UI design,
00:03:07in my opinion.
00:03:08I really hope the next GPT model
00:03:10does something about this.
00:03:11Fable 5 actually surprised me on that test
00:03:13by being the quickest.
00:03:14It took around eight minutes
00:03:15to finish that finance dashboard,
00:03:17whereas Opus took 12 minutes,
00:03:18and GPT 5.5 took 15 minutes
00:03:20to make that abomination.
00:03:22Besides just my demos,
00:03:23one of my favorite ones was Anthropic,
00:03:24showing Fable 5 building a 3D printable CAD model
00:03:27in a browser-based CAD editor
00:03:28that Fable 5 itself also made.
00:03:31Like, building your own mini-software
00:03:32is just so achievable now,
00:03:34and the same thing goes for drugs.
00:03:36Apparently this model is really good at drug design,
00:03:38but you probably don't need to know about that one,
00:03:40and yes, it's definitely safeguarded,
00:03:43as it's basically anything
00:03:44that goes near cybersecurity,
00:03:45unless you're one of the enterprises
00:03:46in that special program.
00:03:48Fable 5 is apparently going to be really cautious,
00:03:51which means it's going to have
00:03:51a fair few false positives,
00:03:53apparently less than 5% of messages,
00:03:55but that still seems pretty high to me,
00:03:57and I've actually run into Opus safeguards before,
00:03:59so this one is probably going to be worse.
00:04:01Apparently though,
00:04:02instead of just saying no outright,
00:04:04it will try and send your request
00:04:05to Opus 4.8 first
00:04:06to see if it's safe for that model to do the work,
00:04:09but again, I've run into these safeguards before,
00:04:11so I'm not too sure how well that's going to work.
00:04:13This benchmark actually shows off
00:04:14just how insane those safeguards might be.
00:04:17Testing it on cyber evaluations,
00:04:19Fable 5 with its safeguards
00:04:20passes zero of these tests.
00:04:22It just flat out refuses to do anything,
00:04:24and as I said earlier,
00:04:25if Opus sometimes rejects me
00:04:27with an 88% pass rate on this test,
00:04:29I see a lot of people
00:04:30running into safeguards with Mythos.
00:04:32The final thing to discuss then
00:04:33is the pricing,
00:04:34and this is where things get a little interesting.
00:04:37It's $10 for a million input tokens,
00:04:39and $50 for a million output tokens,
00:04:41which I don't actually think is too bad,
00:04:42it's not the worst that we've ever seen,
00:04:44but what I don't particularly like
00:04:45is this next block.
00:04:47Fable 5 is available from today
00:04:48in Pro Max team and enterprise plans,
00:04:50but then in a couple of weeks
00:04:52on June 23rd,
00:04:53they're essentially going to rug plus
00:04:54and take those models away,
00:04:56and after that,
00:04:56it's going to require usage credits.
00:04:58Then after this,
00:04:59they say they're going to add these models
00:05:01back into those plans
00:05:02at some undetermined date.
00:05:04It just seems like an odd way of doing things,
00:05:05and I suppose their goal
00:05:06is to get you hooked on these models,
00:05:08and then take them away from you,
00:05:09and make you spend more money on them,
00:05:11and I think it signals
00:05:12just how expensive these models are
00:05:13for them to run.
00:05:14Oh, and it also uses your limits
00:05:16twice as fast as Opus,
00:05:17so I probably wouldn't set this
00:05:18as your primary model
00:05:19unless you're some kind of billionaire.
00:05:21The final footnote
00:05:21that I do think is interesting
00:05:23is their new data retention policy.
00:05:25To use these models,
00:05:25they actually require 30-day retention
00:05:27of all traffic
00:05:28on both first- and third-party tools,
00:05:30and supposedly no training
00:05:31is going to be done on this data,
00:05:33it's just again to try
00:05:34and block security threats.
00:05:35So there we go,
00:05:36Mythos is finally here.
00:05:37What do you think about this model release
00:05:39and the future of software?
00:05:40Let me know in the comments down below.
00:05:41While you're there, subscribe,
00:05:42and as always,
00:05:43see you in the next one.
00:05:44Bye.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video