I Cut My AI Agent Costs 70% With One Change (Manifest)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00This is Manifest. I switched to it for a weekend and my token costs dropped by 70%.

00:00:05Same agent, same tasks, just better routing. If you're building AI agents, there's a good chance

00:00:11you're paying way more than you should. Most requests don't need GPT-4-0 or Claude Opus,

00:00:17but that's exactly what they're hitting anyway. So our agent ends up using expensive models for

00:00:22basic stuff like classification, routing, summaries, and that's how your bill quietly

00:00:27becomes three to five times higher than it should be. How does Manifest even work? Let's find out.

00:00:37Here's where things break down. Agents don't just make a few calls, they make thousands of these calls.

00:00:44And most of those calls are really simple. Pick a tool, summarize a chunk, classify input. But if

00:00:50everything goes to the best model, you're paying a premium price for rather basic work. So you could

00:00:57try to fix it, I guess by writing routing logic, and now your code is full of all these if-else

00:01:02statements that break the second your prompts change. Okay, yes, we could just use OpenRouter,

00:01:08sure, but there's a fee with that. And then your prompts actually leave the machine. I guess there's

00:01:13also something called Lite LLM you could try, which is solid, but you still have to manage routing

00:01:18manually. So the real problem isn't access to models, it's choosing the right one every single time.

00:01:25And that, ladies and gentlemen, is what Manifest does. It sits between your agent and your models.

00:01:31You send one request, it scores it across 23 dimensions, and roots it to the cheapest model

00:01:36that can handle it. There's no rewrites in just one endpoint. If you enjoy coding tools and tips like

00:01:41this, be sure to subscribe. We have videos coming out all the time. All right, sweet. Now let me show you.

00:01:47Same agent, same task. I spin up Manifest with Docker here, simple curl command, Docker Compose up,

00:01:55and now I point my OpenAI endpoint to it. That's the only change here. Now I can link different ones

00:02:01here, as you can see, Anthropic, OpenAI, Olama. I chose OpenAI, dropped in my key, and I linked in

00:02:08Olama so it can go between the two. And now we're going to run this Python script. You can see I'm using

00:02:12the Manifest API key here. That's the only key we need because Manifest has the other ones, okay?

00:02:18So when we run this, the agent starts working. And instead of sending everything to an expensive

00:02:24model, Manifest makes a decision. This one's simple. Root it cheaper. Now jump back here. Our dashboard

00:02:31updates in real time, showing us token usage, cost per agent, and budget tracking. The key number

00:02:38can change, but it can be anywhere up to 70% cheaper. Same output, lower cost, and because

00:02:44this runs locally, your prompts don't leave your machine just to be rooted. This didn't take a whole

00:02:50lot of time or resources, so it's something worth integrating into your flow, especially if you're

00:02:55building and using AI. Okay, so now what actually happens here? You can think of Manifest as like a

00:03:00controller, right? Your agent sends one request in, Manifest decides where it should actually go,

00:03:07so that could be an API model, could be a subscription, a local model, a llama or llama CPP.

00:03:14It supports hundreds of models across tons of providers, but here's the important part to all

00:03:19this. It doesn't call another LLM to decide. That would be counterintuitive, so it would just be

00:03:25slow and expensive. Instead, it uses deterministic scoring, so rooting happens under two milliseconds.

00:03:32No added latency to any of this. Manifest just sits in the middle, and it makes better decisions,

00:03:38and it's clearly built for agents. Open call plugin, multi-agent tracking, we have those, and we even

00:03:44have observability built in. The biggest savings don't come from hard prompts. They come from all the

00:03:50small ones. Really just the boring calls our agents make constantly. Okay, so real quick, how is this

00:03:56different from tools that we already know, so I'm going to compare this really quickly? I mentioned

00:04:01OpenRouter earlier. So OpenRouter gives you one cloud endpoint, but your traffic still leaves your

00:04:06system. Manifest can run fully self-hosted. Then we have the tool I mentioned of Lite LLM. This gives

00:04:13you a unified interface, but rooting is still something you have to control manually. Manifest handles

00:04:19routing automatically. There's also routing intelligence. Now, where Manifest scores requests across 23

00:04:25dimensions, that is their version of routing intelligence. Other things like this rely on failover

00:04:31or rules. Then we have subscriptions. Yes. So while you don't actually pay for Manifest, you still

00:04:38obviously need things like an OpenAI or Clawed API key, right? Now, agent focus is something where

00:04:46Manifest actually stands out. It's built for multi-agent workflows. So the difference is simple.

00:04:51If you want access, just use OpenRouter, right? If you want control, there's Lite LLM. But if your

00:04:57problem is actually cost from agents, because we're making all these API calls, Manifest is built for

00:05:03that. There are countless tools to bring down your costs. You just need to find them, and this is one

00:05:08of the ways. Now, being honest here, because it's great, but with an AI tool, you're going to get some

00:05:14things that might have you just honestly scratching your head. First, the good. Where the first would

00:05:19be savings, especially with subscription routing. You're using plans you already pay for instead of

00:05:26paying per token again. Then the fallbacks, right? If something fails, your agent keeps going, which is

00:05:33a huge win. Then we have the dashboard. The dashboard is great because you can actually see where your money

00:05:38is going across different models, per agent, per task, all in real time. And it works with existing

00:05:45clients without any big rewrites. But like I said, there are things that we would expect a tool like

00:05:50this such to have. And you know, there's things like your scoring is going to be opinionated, right?

00:05:56AI. Okay. So sometimes it routes cheaper than you'd expect. You can override that, but you need to know

00:06:02it's happening in the background. Setup also isn't zero because you're still managing keys and wiring

00:06:07providers, but it was dead simple. And devs still want more SDKs, more storage options, and more

00:06:13features. So yeah, it's really cool, but it's still infrastructure. It's not perfect. Some things need

00:06:19to be tweaked. It's definitely worth it if you run agents every day, or if your agents make lots of

00:06:25small calls. Heck, even if you care about keeping prompts local, this is great, but maybe not if you

00:06:32want zero setup. In that case, something like open router is simpler, but for most of us devs building

00:06:38agents, this is one of the fastest ways to reduce your cost because you don't change your agent. We keep

00:06:44everything. You just change how it routes together. Same inputs, same outputs, lower bill. And that's the

00:06:50key here. If you enjoy coding tools and tips like this, be sure to subscribe to the BetterStack channel.

00:06:54We'll see you in another video.

Key Takeaway

Implementing the local routing tool Manifest cuts AI agent token costs by up to 70% by automatically directing requests to the most cost-effective model using deterministic scoring.

Highlights

Integrating the Manifest tool reduces AI agent token costs by up to 70% without changing agents or tasks.
Manifest sits between the agent and LLMs, using deterministic scoring across 23 dimensions to route requests to the cheapest model capable of handling the task.
Routing decisions occur in under two milliseconds, avoiding the latency and cost of using an LLM to determine model selection.
Manifest supports self-hosting, keeping prompts local, and provides real-time observability for token usage and budget tracking.
Unlike manual routing or open-cloud endpoints, Manifest automates the routing process specifically for multi-agent workflows.

Timeline

Inefficiency in current agent workflows

AI agents frequently overuse expensive models like GPT-4o or Claude Opus for basic tasks.
Routine agent operations include classification, routing, and summarization, which do not require premium model capabilities.
Attempting manual routing via if-else code statements often leads to fragility when prompts change.
External routing services often introduce fees and require sending prompt data outside local infrastructure.

Agent workflows typically involve thousands of small, simple calls that are disproportionately expensive when routed to top-tier models. Maintaining routing logic manually within the application code is unreliable, and existing alternatives like OpenRouter or LiteLLM either require external cloud traffic or manual management of routing rules.

Manifest as a routing controller

Manifest acts as a middleware controller, scoring each request across 23 dimensions to determine the optimal model.
Routing decisions are made via deterministic scoring rather than relying on another LLM, keeping latency under two milliseconds.
Installation involves running the tool via Docker and pointing the existing agent endpoint to it.
The system supports multiple providers including Anthropic, OpenAI, and Ollama, while providing real-time dashboard analytics.

By acting as a localized, intelligent gatekeeper, Manifest intercepts agent requests and routes them to the most efficient model—whether API-based or local. This approach minimizes cost without modifying the agent's core code. The deterministic nature of the scoring engine ensures the routing process does not introduce delays.

Comparison and practical considerations

Manifest allows fully self-hosted operation, distinguishing it from cloud-reliant services like OpenRouter.
The tool provides automated routing intelligence for multi-agent workflows, unlike the manual configuration required by LiteLLM.
Real-time dashboards track token usage and costs per agent, task, and model.
Users may need to manually override opinionated routing decisions, and some infrastructure setup for providers and keys is required.

While offering significant cost savings and control, Manifest is infrastructure-based and requires initial setup of keys and providers. It serves as a solution for production environments where agents execute frequent, small calls, enabling the use of cheaper local or subscription-based models for tasks that do not require maximum reasoning power.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video