Transcript
00:00:00This is Headroom, an open source tool that compresses everything your AI agent reads,
00:00:04so that's tool calls, code files and RAG, before it reaches the LLM, meaning you can reduce tokens
00:00:09by 60 or even 95% to get the exact same answer. And the clever part is, it's reversible, so the
00:00:14model can ask for the full information back whenever it actually needs it. But compression
00:00:18usually means you lose something, so how do you remove most of the context and still get the right
00:00:23answer? This is genuinely an interesting question, so hit subscribe and let's find out.
00:00:31If you've ever used a harness like ClaudeCode, you know it uses a lot of tokens. Every tool call
00:00:35could dump huge JSON logs, which are mostly noise, detracting from the important information,
00:00:40and all of this gets stuffed into the context window, which is what you're paying for.
00:00:45Especially if you use something like Opus on UltraCode mode, which runs dynamic workflows,
00:00:50spinning up parallel sub-agents with no token cap. This is why Tejas Chopra, a senior dev at Netflix,
00:00:57created Headroom, which works by detecting the content type and keeping the important information.
00:01:01So for JSON arrays, it keeps the anomalies and edge cases, it has a code compressor that reads the
00:01:06actual syntax tree, and when it reads build logs, it keeps the failures and throws away the passing tests.
00:01:11But here's the interesting part. For plain text, Headroom uses its own model called CompressBase,
00:01:17which Tejas trained himself just for compression, and this model runs locally on your machine.
00:01:22Headroom claims it's already saved users around $700,000 in tokens,
00:01:26and what's really clever is that it leaves a breadcrumb in the compressed text,
00:01:30containing a hash that the model can use to retrieve the uncompressed data if it ever needs it.
00:01:35Now, if you've watched James' video on Caveman, that also reduces context,
00:01:39but from the opposite direction, and I'll explain more of that later on in the video.
00:01:43But for now, let's see a basic example of Headroom to understand how it works.
00:01:46Now, Headroom works by using a Python server that sits between your app,
00:01:50so this could be crawled code, and for example, the anthropic servers.
00:01:54So when a tool call result comes back, the proxy compresses it using Rust under the hood,
00:01:59and just sends the compressed version to the API.
00:02:01So you can install the server with pip, but I'm going to use uv and make sure the version of Python
00:02:06is 3.12, because it won't work on newer versions than that.
00:02:09Then run the headroom proxy command from this library, which triggers the proxy on this port.
00:02:14Headroom also has a TypeScript or Python SDK,
00:02:17and for the demo, we're going to use the Python one to create an app using the clawed SDK.
00:02:22So we can install both like this, and then we're ready to go through the app.
00:02:25Now, the plan is to show you how to use Headroom with clawed code later on,
00:02:29but I just wanted to show you how it works behind the scenes first.
00:02:32So for this app, we have a user prompt to read all the log files and find out the error,
00:02:36as well as the root cause. And from here, we're going to fake the tool call.
00:02:40So we're going to get clawed to make a bash tool call to cat the server log file,
00:02:44which contains a bunch of fake logs and is imported up here.
00:02:47And then we're going to return the tool call results.
00:02:49Now, the reason we're not just giving Headroom the text file directly
00:02:52is because it only compresses tool call output.
00:02:54So here we specify the model and below it, we're using the headroom compress function
00:02:59to take the message with the model for accurate token counting.
00:03:02Headroom does not actually use Haiku.
00:03:04And then we give it the base URL of the proxy.
00:03:06And then we have a bunch of control logs for testing purposes,
00:03:08showing you the message before and after headroom,
00:03:11and some more control logs showing the percentage saving.
00:03:13And after that, we pass the compressed message from headroom into clawed code,
00:03:17which also contains the user prompt.
00:03:18So now if we run that file, we can see headroom has saved 98% of the tokens.
00:03:23So here are the tokens before and here are the tokens afterwards.
00:03:26So it saves over 17,000 tokens.
00:03:28And it's obvious to see when we look at the before and after.
00:03:31So if we scroll up, this is the before, so this is what normally is sent to clawed code.
00:03:35We get the user prompt, the tool call and the tool response, which is the whole log file.
00:03:39And if we look here at what headroom sends, we can see we get the same user message and tool call,
00:03:43but the tool response is way less.
00:03:45And what it's done here is used statistical compression to drop redundant tokens.
00:03:50So it's removed 419 similar info logs and compressed them to a summary.
00:03:54Now here we can see below headroom tells clawed that this is the compressed output.
00:03:58It can retrieve it using this hash.
00:04:00Now here we see one of the immediate disadvantages of headroom is that clawed thinks it doesn't have
00:04:05enough information to complete the task, but it definitely does.
00:04:08So what we're going to do is run our file again.
00:04:10And we can see that this time we still have the 98% savings, but we have way more information from clawed.
00:04:16Let's try another demo.
00:04:17As usual, we need to run the headroom proxy, but this time I'm giving it more parameters.
00:04:21So here we can see I'm adding the ML value, which uses the compress model locally for compressing plain text.
00:04:26And I've added code to make available the code aware compressor.
00:04:30And then I've added the code aware flag to turn it on.
00:04:32So now we can see it's enabled here.
00:04:34Then I'm going to run clawed code, but first I'm going to set the base URL to the proxy.
00:04:39And so with that in place, I'm going to give clawed a prompt of read every single TS file in this project
00:04:44and give me a deep overview of what this project is doing with citations to the relevant code.
00:04:49And after a while, it gives me a response telling me it's read all the TypeScript files
00:04:53across the five packages and it's given me a default overview.
00:04:56But if we run the context slash command, which I've done earlier, we can see it's used 89.1k tokens.
00:05:02Now I actually went ahead and ran a similar prompt in clawed without using headroom.
00:05:06And if we scroll down to the bottom and see where we caused the context sub command,
00:05:10this is used a bit more tokens.
00:05:11Now, I'm not sure why it's chose to use the opus 1 million context window here.
00:05:16n has chosen the 200k context window here, but we can curl this endpoint on format with jq
00:05:21to see exactly where the compression was from the proxy.
00:05:23Now, this contains a lot of information, so it took me a while to find it.
00:05:26But if we scroll up, we can see how many tokens were saved by the headroom compression
00:05:30and even see how much money the compression saved us.
00:05:32Now, of course, all of this was just from one prompt.
00:05:35But imagine if I had multiple clawed code sessions running and I had headroom compressing all the tool
00:05:39calls. Imagine just how many more tokens I would save.
00:05:42I also want to point out that when I ran the exact prompt with low efforts on opus,
00:05:46headroom didn't actually make any token savings.
00:05:49It was only when I moved from low to medium that the token savings were visible.
00:05:53So maybe if I was on high, x-high or even max, then it would save even more tokens.
00:05:57But anyway, that was a quick overview of headroom.
00:06:00And of course, there are so many more features I could have gone through,
00:06:03like cross-agent memory, which lets clawed, codex and other harnesses
00:06:07share the exact same compressed context.
00:06:09Headroom Learn, which mines your failed sessions to figure out what it compressed
00:06:12too hard and learns so it doesn't do the same mistake in the future,
00:06:15as well as integrations with popular SDKs.
00:06:18But there is one kind of important thing to consider about Headroom.
00:06:21Each time the model doesn't get the information it needs
00:06:24and asks Headroom to provide the full data, it makes a second round trip,
00:06:28which kind of means you end up using more tokens with Headroom in some cases than without.
00:06:33But I guess this is the advantage of using the Headroom Learn feature,
00:06:36which tries to prevent that from happening more and more in the future.
00:06:39But remember when I spoke about Caveman earlier on in the video?
00:06:42Well, Caveman reduces tokens by instructing the model to respond in short fragments,
00:06:46dropping filler words and so on.
00:06:48But as you've just seen in the demo, Headroom shrinks what the model reads
00:06:51before it even gets to the model.
00:06:52So one cuts the output while the other one cuts the input,
00:06:56which means technically you can use them together for maximum token saving,
00:07:00if you really care about saving tokens that much.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video