I Cut My AI Agent Costs 70% With One Change (Manifest)

BBetter Stack
Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00This is Manifest. I switched to it for a weekend and my token costs dropped by 70%.
00:00:05Same agent, same tasks, just better routing. If you're building AI agents, there's a good chance
00:00:11you're paying way more than you should. Most requests don't need GPT-4-0 or Claude Opus,
00:00:17but that's exactly what they're hitting anyway. So our agent ends up using expensive models for
00:00:22basic stuff like classification, routing, summaries, and that's how your bill quietly
00:00:27becomes three to five times higher than it should be. How does Manifest even work? Let's find out.
00:00:37Here's where things break down. Agents don't just make a few calls, they make thousands of these calls.
00:00:44And most of those calls are really simple. Pick a tool, summarize a chunk, classify input. But if
00:00:50everything goes to the best model, you're paying a premium price for rather basic work. So you could
00:00:57try to fix it, I guess by writing routing logic, and now your code is full of all these if-else
00:01:02statements that break the second your prompts change. Okay, yes, we could just use OpenRouter,
00:01:08sure, but there's a fee with that. And then your prompts actually leave the machine. I guess there's
00:01:13also something called Lite LLM you could try, which is solid, but you still have to manage routing
00:01:18manually. So the real problem isn't access to models, it's choosing the right one every single time.
00:01:25And that, ladies and gentlemen, is what Manifest does. It sits between your agent and your models.
00:01:31You send one request, it scores it across 23 dimensions, and roots it to the cheapest model
00:01:36that can handle it. There's no rewrites in just one endpoint. If you enjoy coding tools and tips like
00:01:41this, be sure to subscribe. We have videos coming out all the time. All right, sweet. Now let me show you.
00:01:47Same agent, same task. I spin up Manifest with Docker here, simple curl command, Docker Compose up,
00:01:55and now I point my OpenAI endpoint to it. That's the only change here. Now I can link different ones
00:02:01here, as you can see, Anthropic, OpenAI, Olama. I chose OpenAI, dropped in my key, and I linked in
00:02:08Olama so it can go between the two. And now we're going to run this Python script. You can see I'm using
00:02:12the Manifest API key here. That's the only key we need because Manifest has the other ones, okay?
00:02:18So when we run this, the agent starts working. And instead of sending everything to an expensive
00:02:24model, Manifest makes a decision. This one's simple. Root it cheaper. Now jump back here. Our dashboard
00:02:31updates in real time, showing us token usage, cost per agent, and budget tracking. The key number
00:02:38can change, but it can be anywhere up to 70% cheaper. Same output, lower cost, and because
00:02:44this runs locally, your prompts don't leave your machine just to be rooted. This didn't take a whole
00:02:50lot of time or resources, so it's something worth integrating into your flow, especially if you're
00:02:55building and using AI. Okay, so now what actually happens here? You can think of Manifest as like a
00:03:00controller, right? Your agent sends one request in, Manifest decides where it should actually go,
00:03:07so that could be an API model, could be a subscription, a local model, a llama or llama CPP.
00:03:14It supports hundreds of models across tons of providers, but here's the important part to all
00:03:19this. It doesn't call another LLM to decide. That would be counterintuitive, so it would just be
00:03:25slow and expensive. Instead, it uses deterministic scoring, so rooting happens under two milliseconds.
00:03:32No added latency to any of this. Manifest just sits in the middle, and it makes better decisions,
00:03:38and it's clearly built for agents. Open call plugin, multi-agent tracking, we have those, and we even
00:03:44have observability built in. The biggest savings don't come from hard prompts. They come from all the
00:03:50small ones. Really just the boring calls our agents make constantly. Okay, so real quick, how is this
00:03:56different from tools that we already know, so I'm going to compare this really quickly? I mentioned
00:04:01OpenRouter earlier. So OpenRouter gives you one cloud endpoint, but your traffic still leaves your
00:04:06system. Manifest can run fully self-hosted. Then we have the tool I mentioned of Lite LLM. This gives
00:04:13you a unified interface, but rooting is still something you have to control manually. Manifest handles
00:04:19routing automatically. There's also routing intelligence. Now, where Manifest scores requests across 23
00:04:25dimensions, that is their version of routing intelligence. Other things like this rely on failover
00:04:31or rules. Then we have subscriptions. Yes. So while you don't actually pay for Manifest, you still
00:04:38obviously need things like an OpenAI or Clawed API key, right? Now, agent focus is something where
00:04:46Manifest actually stands out. It's built for multi-agent workflows. So the difference is simple.
00:04:51If you want access, just use OpenRouter, right? If you want control, there's Lite LLM. But if your
00:04:57problem is actually cost from agents, because we're making all these API calls, Manifest is built for
00:05:03that. There are countless tools to bring down your costs. You just need to find them, and this is one
00:05:08of the ways. Now, being honest here, because it's great, but with an AI tool, you're going to get some
00:05:14things that might have you just honestly scratching your head. First, the good. Where the first would
00:05:19be savings, especially with subscription routing. You're using plans you already pay for instead of
00:05:26paying per token again. Then the fallbacks, right? If something fails, your agent keeps going, which is
00:05:33a huge win. Then we have the dashboard. The dashboard is great because you can actually see where your money
00:05:38is going across different models, per agent, per task, all in real time. And it works with existing
00:05:45clients without any big rewrites. But like I said, there are things that we would expect a tool like
00:05:50this such to have. And you know, there's things like your scoring is going to be opinionated, right?
00:05:56AI. Okay. So sometimes it routes cheaper than you'd expect. You can override that, but you need to know
00:06:02it's happening in the background. Setup also isn't zero because you're still managing keys and wiring
00:06:07providers, but it was dead simple. And devs still want more SDKs, more storage options, and more
00:06:13features. So yeah, it's really cool, but it's still infrastructure. It's not perfect. Some things need
00:06:19to be tweaked. It's definitely worth it if you run agents every day, or if your agents make lots of
00:06:25small calls. Heck, even if you care about keeping prompts local, this is great, but maybe not if you
00:06:32want zero setup. In that case, something like open router is simpler, but for most of us devs building
00:06:38agents, this is one of the fastest ways to reduce your cost because you don't change your agent. We keep
00:06:44everything. You just change how it routes together. Same inputs, same outputs, lower bill. And that's the
00:06:50key here. If you enjoy coding tools and tips like this, be sure to subscribe to the BetterStack channel.
00:06:54We'll see you in another video.

Key Takeaway

Implementing the local routing tool Manifest cuts AI agent token costs by up to 70% by automatically directing requests to the most cost-effective model using deterministic scoring.

Highlights

  • Integrating the Manifest tool reduces AI agent token costs by up to 70% without changing agents or tasks.

  • Manifest sits between the agent and LLMs, using deterministic scoring across 23 dimensions to route requests to the cheapest model capable of handling the task.

  • Routing decisions occur in under two milliseconds, avoiding the latency and cost of using an LLM to determine model selection.

  • Manifest supports self-hosting, keeping prompts local, and provides real-time observability for token usage and budget tracking.

  • Unlike manual routing or open-cloud endpoints, Manifest automates the routing process specifically for multi-agent workflows.

Timeline

Inefficiency in current agent workflows

  • AI agents frequently overuse expensive models like GPT-4o or Claude Opus for basic tasks.
  • Routine agent operations include classification, routing, and summarization, which do not require premium model capabilities.
  • Attempting manual routing via if-else code statements often leads to fragility when prompts change.
  • External routing services often introduce fees and require sending prompt data outside local infrastructure.

Agent workflows typically involve thousands of small, simple calls that are disproportionately expensive when routed to top-tier models. Maintaining routing logic manually within the application code is unreliable, and existing alternatives like OpenRouter or LiteLLM either require external cloud traffic or manual management of routing rules.

Manifest as a routing controller

  • Manifest acts as a middleware controller, scoring each request across 23 dimensions to determine the optimal model.
  • Routing decisions are made via deterministic scoring rather than relying on another LLM, keeping latency under two milliseconds.
  • Installation involves running the tool via Docker and pointing the existing agent endpoint to it.
  • The system supports multiple providers including Anthropic, OpenAI, and Ollama, while providing real-time dashboard analytics.

By acting as a localized, intelligent gatekeeper, Manifest intercepts agent requests and routes them to the most efficient model—whether API-based or local. This approach minimizes cost without modifying the agent's core code. The deterministic nature of the scoring engine ensures the routing process does not introduce delays.

Comparison and practical considerations

  • Manifest allows fully self-hosted operation, distinguishing it from cloud-reliant services like OpenRouter.
  • The tool provides automated routing intelligence for multi-agent workflows, unlike the manual configuration required by LiteLLM.
  • Real-time dashboards track token usage and costs per agent, task, and model.
  • Users may need to manually override opinionated routing decisions, and some infrastructure setup for providers and keys is required.

While offering significant cost savings and control, Manifest is infrastructure-based and requires initial setup of keys and providers. It serves as a solution for production environments where agents execute frequent, small calls, enabling the use of cheaper local or subscription-based models for tasks that do not require maximum reasoning power.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video