How Is This Almost as Good as Opus?

BBetter Stack
컴퓨터/소프트웨어창업/스타트업AI/미래기술

Transcript

00:00:00Minimax just dropped M2.5, a coding model that nearly beats Clod Opus 4.6, but costs one tenth as much.
00:00:07It launched just the other day, it has open waits, has 230 billion parameters, and it's built for agent workflows.
00:00:14If you're building AI agents, co-pilots, or automation tools, this will change your costs overnight.
00:00:19And the wild parts are not only the benchmarks, but also the price.
00:00:23We have videos coming out all the time, be sure to subscribe.
00:00:31Minimax M2.5 is a mixture of experts model that has 230 billion total parameters, but only 10 billion are active when it runs.
00:00:39So you get a huge model without paying for the whole thing every time.
00:00:43It's built for real-world development workflows, using Python, Java, Rust, multi-file refactors, tool calling loops, even Word and Excel automation.
00:00:53Now there are two versions with this, you have the standard, which is 50 tokens per second, and then lightning, which is 100 tokens per second.
00:01:01It's multilingual, and it's fully open waits on Hugging Face.
00:01:05That means you can fine-tune it, run it on perm, and avoid lock-ins, and this is where things start to get interesting for agents.
00:01:12I ran the same prompt on both Opus and Minimax to build out a full-stack Kanban board.
00:01:18Nothing too crazy here, just enough to really get them to build something to see how they compare.
00:01:23The exact prompt that I used I put in the description if you guys want to read over it, but first we're going to look here at the Opus version, which took about 4 minutes to run.
00:01:31We get as we would expect, I didn't have to prompt it again, this was the final output.
00:01:37Everything here is super smooth, it works really good, the UI looks also pretty good for being a starter.
00:01:44Drag and drop works as it should, editing tasks also work as it should. I really like this little label here with the correct folder, and it changes as we drag them. That's a cool bonus.
00:01:55All in all, Opus did a really good job here, that's kind of what I expected going into this.
00:02:00Now, on to Minimax. This did take about 8 minutes to finish, maybe because I imported it into Cursor instead of running it on their site, but it's in Cursor, I wanted that.
00:02:10While it did take longer, it cost one tenth of the price, so I'm not going to argue with that.
00:02:14All in all, it did a really good job off only one prompt. The UI lacks a bit compared to Opus, but we still have the same functionality.
00:02:22I can create tasks, drag and drop them into the correct column, so all that works great.
00:02:27The only thing it did not do is add that little label that I liked onto each card as Opus did.
00:02:33Another point it didn't get right was the ability to edit the description of the box.
00:02:38If I edit the description, you see here, nothing changes.
00:02:42So I would have to run this a second time to get that to do what it needs to do, basically.
00:02:48Now that's still okay, because again, one tenth the cost.
00:02:51Now let's talk about what actually matters to developers. M2.5 uses reinforcement learning for task decomposition.
00:02:58So it breaks problems down better, which leads to 20% fewer tool calls and 5% less token waste.
00:03:06If you've built agents before, you know tool calls are where things start to get expensive and they can lead to a mess.
00:03:13It also handles multi-file edits, run, debug, fix loops, those type of things, switching between tools without actually falling apart.
00:03:21On search benchmarks, it reduces search rounds by 20% compared to their previous M2.1.
00:03:27It supports caching too, which means repeated queries can cost less over time.
00:03:32You can plug it right into a llama, local clusters, GitHub automations, or your CI pipelines.
00:03:37Now benchmarks, right? I'm comparing this to Opus here.
00:03:40Well, on SWE bench verified, M2.5 scored over 80%.
00:03:45Clawed Opus 4.6 slightly higher than this at just over 80% too. That's a really small gap here.
00:03:52On the multi-SWE bench, it scores over 51% topping other open models.
00:03:58And on DROID, it actually beats Opus by just .2%, right? So it depends on where you look here.
00:04:05Now speed. It's 37% faster than their previous model. It still took 8 minutes here, okay?
00:04:11Opus 4.6 averages a slightly faster speed, but it does get identical when you run it in the right format.
00:04:18So what does this mean for you? Well, it could mean a few things.
00:04:20It could mean fewer retries, cleaner CI runs, less token churn, more merged pool requests.
00:04:26And in agentic task performance, it's matching things like GPT-5 or Gemini 3 Pro territory,
00:04:32but with open weights, right? So now let's talk about the part that changes things,
00:04:37which really here, even if it took longer, is the pricing.
00:04:40M2.5 standard costs $0.15 per million input tokens and $1.20 per million output tokens.
00:04:47Lightning is double that. So $0.30 per million input, $2.40 per output.
00:04:53Running lightning at 100 tokens per second for an hour, that's about a dollar.
00:04:56If you run standard, which I actually did here, it's about 30 cents per hour.
00:05:00Now compare that to Claude Opus 4.6. Huge difference.
00:05:04$5 per million input tokens, $25 per million output tokens.
00:05:09Per SWE task, it's roughly 10% of Opus costs, helped by efficiency and fewer tool calls.
00:05:15There's also the free API tier, which is live right now. I did pay for this,
00:05:20okay, but they do have that. And that's where the economics really start to shift.
00:05:24So should you switch from Opus 4.6? Well, performance wise, they're nearly identical.
00:05:30Took a bit longer, right? I was on the standard, not lightning, but they're kind of identical.
00:05:34Task completion time is basically the same. Reasoning depth was comparable.
00:05:39Cost wise, though, that's massively cheaper. So you tell me there.
00:05:43It also uses 20% fewer tool calls and wastes those tokens, like I said earlier.
00:05:47So flexibility wise, it's open weights. You can deploy it locally. You can fine tune it,
00:05:52that means. And then Opus still does have an edge at the very top end of premium intelligence.
00:05:57So, right, that's the premium model that we're still working with.
00:06:00Now here's why this matters, because now you can run agents at scale without that price burden.
00:06:05Because M2.5 has a 59% win rate on advanced agent benchmarks, you can build autonomous
00:06:12repo bots, run persistent coding agents, automate enterprise workflows, right? It's not perfect,
00:06:17but it's really, really good for what we saw here. And the pricing is going to allow you to really
00:06:22experiment and put it to the full test. And Minimax is shipping fast, moving on a months
00:06:27to weeks basis here. Ollama and GitHub integrations are already ramping up.
00:06:32Minimax M2.5 delivers Opus level coding performance at a budget price with open weights. That
00:06:38combination is rare, but 2026 who knows what we're going to see. You can test it out for free over on
00:06:43Minimax or run it on Ollama or pick up an API like I did. Is this the new default model for
00:06:48developer agents? I guess we're going to see how that plays out. We'll see you in another video.

Key Takeaway

Minimax M2.5 provides a high-performance, open-weight alternative to top-tier proprietary models like Claude Opus 4.6, offering comparable coding capabilities at a fraction of the cost for developer agents.

Highlights

Minimax M2.5 is a new 230-billion parameter mixture-of-experts model designed specifically for coding and agent workflows.

The model offers performance nearly identical to Claude Opus 4.6 but at approximately 1/10th of the operational cost.

M2.5 features open weights on Hugging Face, allowing developers to fine-tune, run on-premise, and avoid vendor lock-in.

Advanced reinforcement learning enables better task decomposition, resulting in 20% fewer tool calls and 5% less token waste.

Benchmark scores are highly competitive, with the model achieving over 80% on SWE-bench Verified and beating Opus by 0.2% on DROID.

Two versions are available: 'Standard' at 50 tokens per second and 'Lightning' at 100 tokens per second.

The pricing for the Standard tier is significantly lower than competitors at $0.15 per million input tokens.

Timeline

Introduction to Minimax M2.5

The speaker introduces the new Minimax M2.5 coding model, highlighting its launch as a competitor to Claude Opus 4.6. This model features open weights and 230 billion parameters, specifically optimized for agent workflows and automation. The primary value proposition is its significantly lower cost, which is described as being one-tenth that of its rivals. It is positioned as a game-changer for building AI agents, co-pilots, and general automation tools. The speaker also emphasizes that regular updates and more videos on this topic are available for subscribers.

Model Architecture and Core Features

Minimax M2.5 uses a mixture of experts (MoE) architecture where only 10 billion of its 230 billion total parameters are active at any given time. This design allows for a large model's capabilities without the typical associated costs of running such a massive system. It supports real-world development workflows across languages like Python, Java, and Rust, including complex tasks like multi-file refactoring. There are two distinct performance tiers: Standard at 50 tokens per second and Lightning at 100 tokens per second. Being open-weight on Hugging Face, it allows for local deployment and fine-tuning to prevent vendor lock-in.

Direct Performance Comparison: Kanban Board Test

The speaker conducts a head-to-head comparison by prompting both Claude Opus and Minimax M2.5 to build a full-stack Kanban board. While Opus completed the task in 4 minutes with a very polished UI and specific features like automatic folder labels, Minimax took about 8 minutes. Despite the longer completion time, Minimax's output was functional, supporting task creation and drag-and-drop actions. The speaker notes that the Minimax UI was slightly less refined and missed some small details like the folder labels and description editing. However, the significantly lower cost of Minimax makes these minor differences acceptable for most development use cases.

Technical Improvements and Benchmarks

A deep dive into the technical improvements reveals that M2.5 uses reinforcement learning for better task decomposition. This results in 20% fewer tool calls and 5% less token waste, which is crucial for maintaining efficient agentic workflows. On benchmarks like SWE-bench Verified, both models scored over 80%, indicating a very narrow performance gap. On the DROID benchmark, Minimax actually surpassed Opus by a margin of 0.2%, proving its high-end reasoning capabilities. Speed-wise, the new model is 37% faster than the previous M2.1 version, though actual completion times can vary based on the execution environment.

Pricing Structure and ROI Analysis

Pricing is the most significant differentiator, with the M2.5 Standard costing only $0.15 per million input tokens compared to $5.00 for Claude Opus. The Lightning version is priced at $0.30 per million input, which is still vastly cheaper than the $25.00 per million output tokens charged by competitors. Running the Standard tier for an entire hour costs only about 30 cents, making it highly economical for persistent coding agents. The speaker emphasizes that this shift in economics allows for much more experimentation and large-scale deployment. A free API tier is also available for those who want to test the model without immediate commitment.

Final Verdict and Strategic Outlook

In conclusion, while Claude Opus 4.6 maintains a slight edge in 'premium intelligence,' Minimax M2.5 is nearly identical in reasoning depth and task completion. The open-weight nature and 59% win rate on advanced agent benchmarks make it an ideal choice for autonomous repo bots and enterprise workflows. The speaker notes that Minimax is shipping updates at an impressive pace, moving on a weekly rather than monthly basis. Integrations with popular tools like Ollama and GitHub are already ramping up, suggesting it could become the new default for developers. The video ends by encouraging viewers to test the model and consider the new possibilities in the 2026 AI landscape.

Community Posts

View all posts