I Cut My AI Agent Costs 70% With One Change (Manifest)
BBetter Stack
Computing/SoftwareSmall Business/StartupsInternet Technology
Transcript
00:00:00This is Manifest. I switched to it for a weekend and my token costs dropped by 70%.
00:00:05Same agent, same tasks, just better routing. If you're building AI agents, there's a good chance
00:00:11you're paying way more than you should. Most requests don't need GPT-4-0 or Claude Opus,
00:00:17but that's exactly what they're hitting anyway. So our agent ends up using expensive models for
00:00:22basic stuff like classification, routing, summaries, and that's how your bill quietly
00:00:27becomes three to five times higher than it should be. How does Manifest even work? Let's find out.
00:00:37Here's where things break down. Agents don't just make a few calls, they make thousands of these calls.
00:00:44And most of those calls are really simple. Pick a tool, summarize a chunk, classify input. But if
00:00:50everything goes to the best model, you're paying a premium price for rather basic work. So you could
00:00:57try to fix it, I guess by writing routing logic, and now your code is full of all these if-else
00:01:02statements that break the second your prompts change. Okay, yes, we could just use OpenRouter,
00:01:08sure, but there's a fee with that. And then your prompts actually leave the machine. I guess there's
00:01:13also something called Lite LLM you could try, which is solid, but you still have to manage routing
00:01:18manually. So the real problem isn't access to models, it's choosing the right one every single time.
00:01:25And that, ladies and gentlemen, is what Manifest does. It sits between your agent and your models.
00:01:31You send one request, it scores it across 23 dimensions, and roots it to the cheapest model
00:01:36that can handle it. There's no rewrites in just one endpoint. If you enjoy coding tools and tips like
00:01:41this, be sure to subscribe. We have videos coming out all the time. All right, sweet. Now let me show you.
00:01:47Same agent, same task. I spin up Manifest with Docker here, simple curl command, Docker Compose up,
00:01:55and now I point my OpenAI endpoint to it. That's the only change here. Now I can link different ones
00:02:01here, as you can see, Anthropic, OpenAI, Olama. I chose OpenAI, dropped in my key, and I linked in
00:02:08Olama so it can go between the two. And now we're going to run this Python script. You can see I'm using
00:02:12the Manifest API key here. That's the only key we need because Manifest has the other ones, okay?
00:02:18So when we run this, the agent starts working. And instead of sending everything to an expensive
00:02:24model, Manifest makes a decision. This one's simple. Root it cheaper. Now jump back here. Our dashboard
00:02:31updates in real time, showing us token usage, cost per agent, and budget tracking. The key number
00:02:38can change, but it can be anywhere up to 70% cheaper. Same output, lower cost, and because
00:02:44this runs locally, your prompts don't leave your machine just to be rooted. This didn't take a whole
00:02:50lot of time or resources, so it's something worth integrating into your flow, especially if you're
00:02:55building and using AI. Okay, so now what actually happens here? You can think of Manifest as like a
00:03:00controller, right? Your agent sends one request in, Manifest decides where it should actually go,
00:03:07so that could be an API model, could be a subscription, a local model, a llama or llama CPP.
00:03:14It supports hundreds of models across tons of providers, but here's the important part to all
00:03:19this. It doesn't call another LLM to decide. That would be counterintuitive, so it would just be
00:03:25slow and expensive. Instead, it uses deterministic scoring, so rooting happens under two milliseconds.
00:03:32No added latency to any of this. Manifest just sits in the middle, and it makes better decisions,
00:03:38and it's clearly built for agents. Open call plugin, multi-agent tracking, we have those, and we even
00:03:44have observability built in. The biggest savings don't come from hard prompts. They come from all the
00:03:50small ones. Really just the boring calls our agents make constantly. Okay, so real quick, how is this
00:03:56different from tools that we already know, so I'm going to compare this really quickly? I mentioned
00:04:01OpenRouter earlier. So OpenRouter gives you one cloud endpoint, but your traffic still leaves your
00:04:06system. Manifest can run fully self-hosted. Then we have the tool I mentioned of Lite LLM. This gives
00:04:13you a unified interface, but rooting is still something you have to control manually. Manifest handles
00:04:19routing automatically. There's also routing intelligence. Now, where Manifest scores requests across 23
00:04:25dimensions, that is their version of routing intelligence. Other things like this rely on failover
00:04:31or rules. Then we have subscriptions. Yes. So while you don't actually pay for Manifest, you still
00:04:38obviously need things like an OpenAI or Clawed API key, right? Now, agent focus is something where
00:04:46Manifest actually stands out. It's built for multi-agent workflows. So the difference is simple.
00:04:51If you want access, just use OpenRouter, right? If you want control, there's Lite LLM. But if your
00:04:57problem is actually cost from agents, because we're making all these API calls, Manifest is built for
00:05:03that. There are countless tools to bring down your costs. You just need to find them, and this is one
00:05:08of the ways. Now, being honest here, because it's great, but with an AI tool, you're going to get some
00:05:14things that might have you just honestly scratching your head. First, the good. Where the first would
00:05:19be savings, especially with subscription routing. You're using plans you already pay for instead of
00:05:26paying per token again. Then the fallbacks, right? If something fails, your agent keeps going, which is
00:05:33a huge win. Then we have the dashboard. The dashboard is great because you can actually see where your money
00:05:38is going across different models, per agent, per task, all in real time. And it works with existing
00:05:45clients without any big rewrites. But like I said, there are things that we would expect a tool like
00:05:50this such to have. And you know, there's things like your scoring is going to be opinionated, right?
00:05:56AI. Okay. So sometimes it routes cheaper than you'd expect. You can override that, but you need to know
00:06:02it's happening in the background. Setup also isn't zero because you're still managing keys and wiring
00:06:07providers, but it was dead simple. And devs still want more SDKs, more storage options, and more
00:06:13features. So yeah, it's really cool, but it's still infrastructure. It's not perfect. Some things need
00:06:19to be tweaked. It's definitely worth it if you run agents every day, or if your agents make lots of
00:06:25small calls. Heck, even if you care about keeping prompts local, this is great, but maybe not if you
00:06:32want zero setup. In that case, something like open router is simpler, but for most of us devs building
00:06:38agents, this is one of the fastest ways to reduce your cost because you don't change your agent. We keep
00:06:44everything. You just change how it routes together. Same inputs, same outputs, lower bill. And that's the
00:06:50key here. If you enjoy coding tools and tips like this, be sure to subscribe to the BetterStack channel.
00:06:54We'll see you in another video.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video