Claude Mythos is FINALLY here (Fable 5)

BBetter Stack
컴퓨터/소프트웨어경제 뉴스AI/미래기술

Transcript

00:00:00Claude Mythos is finally here.
00:00:01Anthropic just dropped a new model called Fable 5,
00:00:03which is a Mythos-class model,
00:00:05just with loads of safeguards built in,
00:00:07but it exceeds any model they've ever released,
00:00:09and possibly anyone.
00:00:11It is state-of-the-art on nearly every single benchmark.
00:00:13Obviously, though, this is definitely going to cost you,
00:00:16and they've done something a little interesting
00:00:17with the pricing here
00:00:18that I don't think too many people are going to be happy with.
00:00:25Now, normally, I don't like to spend too long
00:00:27on the benchmarks, but this table is kind of insane.
00:00:30The jumps that this model is making
00:00:31on some of these benchmarks,
00:00:32and the fact that it's ahead on nearly every single one.
00:00:35You can see it has a 10% jump at Argentic Coding
00:00:37on SWE Bench Pro,
00:00:39and it's basically 20% ahead of GPT 5.5,
00:00:42and it made similar leaps on the Frontier Code benchmark.
00:00:44Frontier Code is actually a new benchmark from Cognition,
00:00:47the guys behind Devin,
00:00:48that essentially tests would have maintained
00:00:49that actually merged the code that this model produces.
00:00:52On this chart, you can see Fable 5 is ahead
00:00:54of every other model,
00:00:55even at a medium reasoning effort,
00:00:57but I also think you can see this model
00:00:58is going to be super expensive.
00:01:00It's also marginally better at computer use,
00:01:02not a massive leap,
00:01:03and the same goes for Terminal Bench at the bottom,
00:01:05but again, as you can see,
00:01:06it's a leader in nearly every single category.
00:01:09One of the biggest things, though,
00:01:10that's becoming more and more relevant
00:01:11is long-running tasks.
00:01:12Fable 5 can apparently work for longer
00:01:14than any other model,
00:01:15and they had Stripe test this out,
00:01:17and apparently it performed a codebase-wide migration
00:01:18of a 50 million line Ruby codebase
00:01:21in a single day.
00:01:22Probably helped out by the fact that it's gone
00:01:24much better at memory and long context as well.
00:01:26It can apparently stay focused across millions
00:01:28of tokens in long-running tasks,
00:01:29and it improves its own outputs
00:01:31by using its own notes.
00:01:32Now, besides just coding,
00:01:33its vision capabilities are pretty awesome as well.
00:01:36Apparently, it can beat Pokemon Fire Red
00:01:37with a minimal vision-only harness now,
00:01:39whereas previously they had to give this additional tools,
00:01:42and it still barely beat it,
00:01:43but now it has no issue.
00:01:45It will also apparently happily one-shot a website
00:01:47from a screenshot.
00:01:48I actually tested this out using the Linear website,
00:01:50and it genuinely got a bit confusing for me
00:01:52which one is which here,
00:01:53but the one on the right is the one
00:01:55that Fable 5 generated
00:01:56from just a screenshot of the Linear website.
00:01:58It didn't use web search or anything like that,
00:02:00I just gave it a full screenshot of this webpage,
00:02:02and I would say it's done a pretty awesome job at it.
00:02:05All of the screenshots, everything,
00:02:06have been generated with code,
00:02:08and you can see it's done a very, very good job.
00:02:10It's things like the SVG animations
00:02:12that aren't going to be perfect,
00:02:14but overall, I would say I'm pretty happy
00:02:15with the way that it's recreated this website,
00:02:18and it's nailed pretty much every section,
00:02:20or at least got me to a point
00:02:21where I could then iterate on it
00:02:22to get it exactly how I want.
00:02:24While we're here,
00:02:24I also decided to test these models
00:02:25on building me a front-end and a back-end
00:02:27for a finance dashboard app
00:02:28from a completely empty folder in one shot,
00:02:31and this is what Fable 5 gave me.
00:02:33I have tested everything,
00:02:34everything is working,
00:02:35it talks to the API,
00:02:37and overall, the design does look really nice.
00:02:39It is really usable,
00:02:40but it is that aesthetic
00:02:41that Claude models seem to be giving recently.
00:02:43We can see that in the result
00:02:44that Opus 4.8 gave me as well.
00:02:45Again, I think this site looks really nice,
00:02:47and to be honest with you,
00:02:48I'd argue this looks better than the Fable 5 one,
00:02:50but again, it has that aesthetic
00:02:51that Claude has been trained on,
00:02:53but that is also my fault.
00:02:54I didn't prompt this to go in any particular design.
00:02:56I'm sure if I did,
00:02:57it would have done a great job.
00:02:58If we compare this to what GPT 5.5 gave me,
00:03:00though,
00:03:01you can see it's just not even close.
00:03:03This was from a single prompt,
00:03:04the exact same prompt,
00:03:05and they're just miles behind in UI design,
00:03:07in my opinion.
00:03:08I really hope the next GPT model
00:03:10does something about this.
00:03:11Fable 5 actually surprised me on that test
00:03:13by being the quickest.
00:03:14It took around eight minutes
00:03:15to finish that finance dashboard,
00:03:17whereas Opus took 12 minutes,
00:03:18and GPT 5.5 took 15 minutes
00:03:20to make that abomination.
00:03:22Besides just my demos,
00:03:23one of my favorite ones was Anthropic,
00:03:24showing Fable 5 building a 3D printable CAD model
00:03:27in a browser-based CAD editor
00:03:28that Fable 5 itself also made.
00:03:31Like, building your own mini-software
00:03:32is just so achievable now,
00:03:34and the same thing goes for drugs.
00:03:36Apparently this model is really good at drug design,
00:03:38but you probably don't need to know about that one,
00:03:40and yes, it's definitely safeguarded,
00:03:43as it's basically anything
00:03:44that goes near cybersecurity,
00:03:45unless you're one of the enterprises
00:03:46in that special program.
00:03:48Fable 5 is apparently going to be really cautious,
00:03:51which means it's going to have
00:03:51a fair few false positives,
00:03:53apparently less than 5% of messages,
00:03:55but that still seems pretty high to me,
00:03:57and I've actually run into Opus safeguards before,
00:03:59so this one is probably going to be worse.
00:04:01Apparently though,
00:04:02instead of just saying no outright,
00:04:04it will try and send your request
00:04:05to Opus 4.8 first
00:04:06to see if it's safe for that model to do the work,
00:04:09but again, I've run into these safeguards before,
00:04:11so I'm not too sure how well that's going to work.
00:04:13This benchmark actually shows off
00:04:14just how insane those safeguards might be.
00:04:17Testing it on cyber evaluations,
00:04:19Fable 5 with its safeguards
00:04:20passes zero of these tests.
00:04:22It just flat out refuses to do anything,
00:04:24and as I said earlier,
00:04:25if Opus sometimes rejects me
00:04:27with an 88% pass rate on this test,
00:04:29I see a lot of people
00:04:30running into safeguards with Mythos.
00:04:32The final thing to discuss then
00:04:33is the pricing,
00:04:34and this is where things get a little interesting.
00:04:37It's $10 for a million input tokens,
00:04:39and $50 for a million output tokens,
00:04:41which I don't actually think is too bad,
00:04:42it's not the worst that we've ever seen,
00:04:44but what I don't particularly like
00:04:45is this next block.
00:04:47Fable 5 is available from today
00:04:48in Pro Max team and enterprise plans,
00:04:50but then in a couple of weeks
00:04:52on June 23rd,
00:04:53they're essentially going to rug plus
00:04:54and take those models away,
00:04:56and after that,
00:04:56it's going to require usage credits.
00:04:58Then after this,
00:04:59they say they're going to add these models
00:05:01back into those plans
00:05:02at some undetermined date.
00:05:04It just seems like an odd way of doing things,
00:05:05and I suppose their goal
00:05:06is to get you hooked on these models,
00:05:08and then take them away from you,
00:05:09and make you spend more money on them,
00:05:11and I think it signals
00:05:12just how expensive these models are
00:05:13for them to run.
00:05:14Oh, and it also uses your limits
00:05:16twice as fast as Opus,
00:05:17so I probably wouldn't set this
00:05:18as your primary model
00:05:19unless you're some kind of billionaire.
00:05:21The final footnote
00:05:21that I do think is interesting
00:05:23is their new data retention policy.
00:05:25To use these models,
00:05:25they actually require 30-day retention
00:05:27of all traffic
00:05:28on both first- and third-party tools,
00:05:30and supposedly no training
00:05:31is going to be done on this data,
00:05:33it's just again to try
00:05:34and block security threats.
00:05:35So there we go,
00:05:36Mythos is finally here.
00:05:37What do you think about this model release
00:05:39and the future of software?
00:05:40Let me know in the comments down below.
00:05:41While you're there, subscribe,
00:05:42and as always,
00:05:43see you in the next one.
00:05:44Bye.

Key Takeaway

Anthropic's Fable 5 delivers state-of-the-art coding and reasoning performance but introduces stringent security-focused data retention policies and a costly, credit-based consumption model.

Highlights

  • Fable 5 exhibits a 10% performance increase on SWE Bench Pro and outperforms GPT 5.5 by 20% on the Frontier Code benchmark.

  • A 50 million line Ruby codebase migration was completed in a single day using Fable 5.

  • Fable 5 is priced at $10 per million input tokens and $50 per million output tokens.

  • The model requires 30-day traffic retention for all first- and third-party tools as a security measure.

  • Fable 5 generates complete web front-end and back-end code from screenshots or single prompts in approximately 8 minutes.

  • Usage credits will be required for access starting June 23rd, temporarily removing the model from standard Pro Max team and enterprise plans.

Timeline

Performance Benchmarks and Coding Capabilities

  • Fable 5 leads in nearly every industry-standard coding benchmark.
  • The model maintains focus across millions of tokens for long-running tasks.
  • Stripe validated the model's ability to handle massive codebase migrations.

Fable 5 sets a new standard in coding benchmarks, specifically showing a 10% improvement on SWE Bench Pro. It excels at long-context tasks, demonstrated by its ability to perform a codebase-wide migration of a 50 million line Ruby project in one day. The model's reasoning capabilities are enhanced by its ability to refine outputs using its own generated notes.

Vision, UI Generation, and Tooling

  • Fable 5 reconstructs functional websites directly from single screenshots.
  • The model generates complete full-stack applications in under 8 minutes.
  • Vision capabilities now support complex tasks like beating video games without additional tools.

The model demonstrates significant vision capabilities, successfully recreating complex websites like Linear from static screenshots without external search tools. It also generates complete finance dashboards, including both front-end and back-end, in approximately 8 minutes, outperforming existing alternatives in UI design quality and speed.

Safeguards, Pricing, and Data Policies

  • Cybersecurity-related tasks trigger aggressive safety refusals.
  • All user traffic requires 30-day retention for threat monitoring.
  • Access moves to a credit-based model on June 23rd.

Strict safeguards often prevent the model from executing cybersecurity tasks, sometimes resulting in false positives. To mitigate risks, Anthropic requires a 30-day retention period for all traffic. Starting June 23rd, the model shifts away from inclusive plan access to a usage-credit model, which is expected to consume limits twice as quickly as previous models.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video