Headroom: The Netflix Tool That Makes AI Agents 10x Cheaper

BBetter Stack
컴퓨터/소프트웨어창업/스타트업AI/미래기술

Transcript

00:00:00This is Headroom, an open source tool that compresses everything your AI agent reads,
00:00:04so that's tool calls, code files and RAG, before it reaches the LLM, meaning you can reduce tokens
00:00:09by 60 or even 95% to get the exact same answer. And the clever part is, it's reversible, so the
00:00:14model can ask for the full information back whenever it actually needs it. But compression
00:00:18usually means you lose something, so how do you remove most of the context and still get the right
00:00:23answer? This is genuinely an interesting question, so hit subscribe and let's find out.
00:00:31If you've ever used a harness like ClaudeCode, you know it uses a lot of tokens. Every tool call
00:00:35could dump huge JSON logs, which are mostly noise, detracting from the important information,
00:00:40and all of this gets stuffed into the context window, which is what you're paying for.
00:00:45Especially if you use something like Opus on UltraCode mode, which runs dynamic workflows,
00:00:50spinning up parallel sub-agents with no token cap. This is why Tejas Chopra, a senior dev at Netflix,
00:00:57created Headroom, which works by detecting the content type and keeping the important information.
00:01:01So for JSON arrays, it keeps the anomalies and edge cases, it has a code compressor that reads the
00:01:06actual syntax tree, and when it reads build logs, it keeps the failures and throws away the passing tests.
00:01:11But here's the interesting part. For plain text, Headroom uses its own model called CompressBase,
00:01:17which Tejas trained himself just for compression, and this model runs locally on your machine.
00:01:22Headroom claims it's already saved users around $700,000 in tokens,
00:01:26and what's really clever is that it leaves a breadcrumb in the compressed text,
00:01:30containing a hash that the model can use to retrieve the uncompressed data if it ever needs it.
00:01:35Now, if you've watched James' video on Caveman, that also reduces context,
00:01:39but from the opposite direction, and I'll explain more of that later on in the video.
00:01:43But for now, let's see a basic example of Headroom to understand how it works.
00:01:46Now, Headroom works by using a Python server that sits between your app,
00:01:50so this could be crawled code, and for example, the anthropic servers.
00:01:54So when a tool call result comes back, the proxy compresses it using Rust under the hood,
00:01:59and just sends the compressed version to the API.
00:02:01So you can install the server with pip, but I'm going to use uv and make sure the version of Python
00:02:06is 3.12, because it won't work on newer versions than that.
00:02:09Then run the headroom proxy command from this library, which triggers the proxy on this port.
00:02:14Headroom also has a TypeScript or Python SDK,
00:02:17and for the demo, we're going to use the Python one to create an app using the clawed SDK.
00:02:22So we can install both like this, and then we're ready to go through the app.
00:02:25Now, the plan is to show you how to use Headroom with clawed code later on,
00:02:29but I just wanted to show you how it works behind the scenes first.
00:02:32So for this app, we have a user prompt to read all the log files and find out the error,
00:02:36as well as the root cause. And from here, we're going to fake the tool call.
00:02:40So we're going to get clawed to make a bash tool call to cat the server log file,
00:02:44which contains a bunch of fake logs and is imported up here.
00:02:47And then we're going to return the tool call results.
00:02:49Now, the reason we're not just giving Headroom the text file directly
00:02:52is because it only compresses tool call output.
00:02:54So here we specify the model and below it, we're using the headroom compress function
00:02:59to take the message with the model for accurate token counting.
00:03:02Headroom does not actually use Haiku.
00:03:04And then we give it the base URL of the proxy.
00:03:06And then we have a bunch of control logs for testing purposes,
00:03:08showing you the message before and after headroom,
00:03:11and some more control logs showing the percentage saving.
00:03:13And after that, we pass the compressed message from headroom into clawed code,
00:03:17which also contains the user prompt.
00:03:18So now if we run that file, we can see headroom has saved 98% of the tokens.
00:03:23So here are the tokens before and here are the tokens afterwards.
00:03:26So it saves over 17,000 tokens.
00:03:28And it's obvious to see when we look at the before and after.
00:03:31So if we scroll up, this is the before, so this is what normally is sent to clawed code.
00:03:35We get the user prompt, the tool call and the tool response, which is the whole log file.
00:03:39And if we look here at what headroom sends, we can see we get the same user message and tool call,
00:03:43but the tool response is way less.
00:03:45And what it's done here is used statistical compression to drop redundant tokens.
00:03:50So it's removed 419 similar info logs and compressed them to a summary.
00:03:54Now here we can see below headroom tells clawed that this is the compressed output.
00:03:58It can retrieve it using this hash.
00:04:00Now here we see one of the immediate disadvantages of headroom is that clawed thinks it doesn't have
00:04:05enough information to complete the task, but it definitely does.
00:04:08So what we're going to do is run our file again.
00:04:10And we can see that this time we still have the 98% savings, but we have way more information from clawed.
00:04:16Let's try another demo.
00:04:17As usual, we need to run the headroom proxy, but this time I'm giving it more parameters.
00:04:21So here we can see I'm adding the ML value, which uses the compress model locally for compressing plain text.
00:04:26And I've added code to make available the code aware compressor.
00:04:30And then I've added the code aware flag to turn it on.
00:04:32So now we can see it's enabled here.
00:04:34Then I'm going to run clawed code, but first I'm going to set the base URL to the proxy.
00:04:39And so with that in place, I'm going to give clawed a prompt of read every single TS file in this project
00:04:44and give me a deep overview of what this project is doing with citations to the relevant code.
00:04:49And after a while, it gives me a response telling me it's read all the TypeScript files
00:04:53across the five packages and it's given me a default overview.
00:04:56But if we run the context slash command, which I've done earlier, we can see it's used 89.1k tokens.
00:05:02Now I actually went ahead and ran a similar prompt in clawed without using headroom.
00:05:06And if we scroll down to the bottom and see where we caused the context sub command,
00:05:10this is used a bit more tokens.
00:05:11Now, I'm not sure why it's chose to use the opus 1 million context window here.
00:05:16n has chosen the 200k context window here, but we can curl this endpoint on format with jq
00:05:21to see exactly where the compression was from the proxy.
00:05:23Now, this contains a lot of information, so it took me a while to find it.
00:05:26But if we scroll up, we can see how many tokens were saved by the headroom compression
00:05:30and even see how much money the compression saved us.
00:05:32Now, of course, all of this was just from one prompt.
00:05:35But imagine if I had multiple clawed code sessions running and I had headroom compressing all the tool
00:05:39calls. Imagine just how many more tokens I would save.
00:05:42I also want to point out that when I ran the exact prompt with low efforts on opus,
00:05:46headroom didn't actually make any token savings.
00:05:49It was only when I moved from low to medium that the token savings were visible.
00:05:53So maybe if I was on high, x-high or even max, then it would save even more tokens.
00:05:57But anyway, that was a quick overview of headroom.
00:06:00And of course, there are so many more features I could have gone through,
00:06:03like cross-agent memory, which lets clawed, codex and other harnesses
00:06:07share the exact same compressed context.
00:06:09Headroom Learn, which mines your failed sessions to figure out what it compressed
00:06:12too hard and learns so it doesn't do the same mistake in the future,
00:06:15as well as integrations with popular SDKs.
00:06:18But there is one kind of important thing to consider about Headroom.
00:06:21Each time the model doesn't get the information it needs
00:06:24and asks Headroom to provide the full data, it makes a second round trip,
00:06:28which kind of means you end up using more tokens with Headroom in some cases than without.
00:06:33But I guess this is the advantage of using the Headroom Learn feature,
00:06:36which tries to prevent that from happening more and more in the future.
00:06:39But remember when I spoke about Caveman earlier on in the video?
00:06:42Well, Caveman reduces tokens by instructing the model to respond in short fragments,
00:06:46dropping filler words and so on.
00:06:48But as you've just seen in the demo, Headroom shrinks what the model reads
00:06:51before it even gets to the model.
00:06:52So one cuts the output while the other one cuts the input,
00:06:56which means technically you can use them together for maximum token saving,
00:07:00if you really care about saving tokens that much.

Key Takeaway

Headroom acts as a proxy between applications and LLMs to compress input data by up to 95% using content-specific algorithms, significantly lowering agent operational costs while maintaining the ability to retrieve full context on demand.

Highlights

  • Headroom reduces AI agent context tokens by 60% to 95% by compressing tool calls, code files, and RAG data before they reach the LLM.

  • The tool is reversible, allowing the model to retrieve uncompressed information via a hash embedded in the compressed text if more detail is required.

  • Headroom uses specialized compressors for different content types, including syntax tree analysis for code and anomaly detection for JSON arrays.

  • The tool utilizes a locally running model, CompressBase, for plain text compression to maintain privacy and reduce processing latency.

  • Headroom has documented cumulative savings of approximately $700,000 in token costs for its users.

  • Combining Headroom's input-side compression with output-side reduction methods like Caveman can further maximize token efficiency.

Timeline

Token Context and Compression Logic

  • LLM agents frequently exhaust context windows and inflate costs by processing redundant JSON logs and raw code files.
  • Headroom functions as a proxy that intercepts data, applies content-specific compression, and sends only the essential information to the model.
  • The system retains high-value data like anomalies in JSON, syntax trees in code, and failure logs in build outputs.

AI agents often struggle with vast amounts of noisy data, such as voluminous JSON logs from tool calls, which directly increases costs. Headroom addresses this by intelligently filtering input based on data type. It employs Rust-based logic to perform statistical compression and uses a custom locally run model, CompressBase, for plain text. A critical feature is the inclusion of a hash as a 'breadcrumb' in the compressed text, ensuring the LLM can trigger a retrieval of the full data whenever necessary.

Technical Implementation and Demo

  • Headroom requires a Python server acting as a proxy between the application and the API provider, such as Anthropic.
  • The tool achieves up to 98% token reduction in practical scenarios, specifically when processing extensive log files.
  • Token savings are highly dependent on the agent's intensity settings, showing greater efficiency in medium and high-effort modes compared to low-effort modes.

Deployment involves running a Python proxy server on a specified port and utilizing the Headroom SDK within the application code. In a demo scenario, processing large server logs resulted in a 98% reduction in token count compared to raw input. While the tool allows for granular control through flags like 'code-aware' compression, it does not guarantee immediate savings on all model configurations, particularly with low-effort settings where context is already sparse.

Advanced Features and Trade-offs

  • Headroom includes a 'Learn' feature that analyzes failed sessions to adjust compression aggressiveness and minimize future errors.
  • Cross-agent memory allows different harnesses like ClaudeCode and Codex to share identical compressed contexts.
  • Compression creates a potential overhead where the LLM may trigger a second round-trip to retrieve full data, potentially increasing costs in specific failure cases.

The system offers advanced capabilities beyond simple proxy compression, including adaptive learning to refine its filtering logic based on past failures. Despite these advantages, the fundamental trade-off is the risk of excessive compression; if a model cannot derive an answer from the summary, it must request the original data, resulting in extra API round-trips. However, users can combine Headroom with other reduction methods like Caveman—which compresses output—to optimize both input and output sides of the interaction.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video