00:00:00Making cloud code talk like a caveman might not only save you tokens.
00:00:04It could actually improve your performance as well. Now on the surface,
00:00:07this sounds like a complete meme. We have a GitHub repo called caveman.
00:00:12That's gotten 5,000 stars in 72 hours.
00:00:15And all it does is force cloud code to talk like a Neanderthal.
00:00:19It trims out all the filler. The idea is that by making it more concise,
00:00:24we save a ton of tokens in the process,
00:00:27but buried in this repo is a link to this research paper that just came out a few
00:00:31weeks ago,
00:00:31which tells us if we force our large language models to be more concise,
00:00:36we don't only save tokens, but we can dramatically improve their performance.
00:00:40So today I'm going to break down this entire caveman skill.
00:00:42I'm going to explain what it actually buys you because the numbers in the repo
00:00:46are a little misleading and we're going to talk through this research paper so you
00:00:50can understand what this actually means for you. So this is caveman,
00:00:54our why say many word when few word do trick repo.
00:00:58Now, right off the bat, what is it doing? Pretty simple,
00:01:02cutting out the filler cloud code. Now it talks like a caveman.
00:01:07It gives us some before and after examples shows us the token difference and even
00:01:11has a full benchmark list showing the task. It gave cloud code,
00:01:15explain react, reenter bug, the normal tokens being used,
00:01:19the caveman tokens and the amount saved.
00:01:21Now the numbers put forth in this repo are kind of insane.
00:01:23So they are claiming that with this skill,
00:01:26we are going to cut 75% of output tokens while keeping full technical
00:01:30accuracy.
00:01:31This caveman does not change how cloud code reasons under the hood.
00:01:35It doesn't change how it actually generates code. None of that gets changed.
00:01:38It's just the output. What you see as a response.
00:01:41It also includes a companion tool that compresses your memory files.
00:01:45Think claud.md into caveman speak.
00:01:47And that is supposed to reduce our input tokens by 45% every session.
00:01:52Now let's be clear. You are not cutting 75% of your output tokens at large,
00:01:57and 45% of your input tokens at large at all. That is completely not true.
00:02:01Even though we can see these things that say, Hey,
00:02:03it saves 87% of tokens on how it explains a react reenter bug.
00:02:07The prompt you get back from claud code, the response itself,
00:02:11the text is just a small portion of the output tokens at large,
00:02:15just like the memory files,
00:02:17like claud.md is just a small portion of the input at large.
00:02:21So let's be very clear about what this is actually buying us on a token scale.
00:02:25You are not saving 80% of your total tokens. And to make it a little more clear,
00:02:28let's break down your average hundred thousand token claud code session. Now,
00:02:32I understand every session is a little different, but just work with me here.
00:02:36We have a hundred thousand token session, and it's broken up into two parts.
00:02:40The input, which is the lion share.
00:02:42That's 75,000 tokens in the output, which is 25%.
00:02:46Now caveman is claiming we're going to reduce output by 75%.
00:02:51That is not true. If we take a look at output, it's really in three parts, right?
00:02:56We have tool calls, taking up a portion of it, code blocks,
00:02:59like the actual code generation, taking a portion of it.
00:03:02And then the actual pros responses, this response,
00:03:06that text response internal, that's what caveman is adjusting.
00:03:10That's what it's reducing. It can reduce 75% of that. You know,
00:03:13if we go down here, we can see, okay,
00:03:16so normally the pros takes up six K tokens with caveman.
00:03:20We save 4,000 tokens. So we get a 4% reduction. That's still really good.
00:03:25If we're saving 4% of our total tokens over the course of the week,
00:03:29that certainly adds up,
00:03:30especially in the current environment where we are all so conscious of our usage.
00:03:33But understand this is not 87%. It's 70%,
00:03:3860% of one portion of one portion of the total session.
00:03:43Furthermore,
00:03:44if you look at the inputs and it talks about the caveman compression saving 45%,
00:03:49again, not really.
00:03:50We're talking about the system prompt area and only certain parts of the system
00:03:54prompt. So total here, right? We're saving what? Maybe a thousand tokens,
00:03:58maybe 2000 tokens. And over the course, again, I'm an entire session.
00:04:03If I say 5,000 tokens, 5% of every session, that's great, good stuff,
00:04:07but it's not these gaudy numbers. So understand that going in,
00:04:13this is an on the margin play. This isn't totally change.
00:04:15You're not going to be able to go from basically five X max plan to 20 X max
00:04:19plan because we're saving 75%. No, no, no, no,
00:04:22but there's still tons of value to be add here and even more value to be
00:04:25extracted. Once we take a look at the study, it's kind of buried in here.
00:04:29There's one little section dedicated to it,
00:04:31but this is a study called brevity constraints,
00:04:34reverse performance hierarchies and language models.
00:04:36And this came out in early March of this year.
00:04:38So I will put a link to the study down in the description if you want to check it
00:04:41out, but let's just talk about it really quick because this is really interesting.
00:04:45Because the idea and the expectation is bigger model,
00:04:49better than smaller model always. Well,
00:04:53not exactly, not according to this study.
00:04:56So in this study they evaluated 31 models across 1500
00:05:01problems,
00:05:02and they identified the mechanism as spontaneous scale dependent verbosity that
00:05:07introduces errors through over elaboration. What the heck does that mean?
00:05:11That means on nearly 8% of the problems across these 1500 problems and
00:05:1631 models, the larger language models,
00:05:19the ones with more parameters underperformed smaller ones by 28
00:05:24percentage points, despite a hundred times more parameters in some cases.
00:05:28So you had scenarios where again, this is with all open weight models.
00:05:32You had a 2 billion parameter model outperforming a 400 billion parameter
00:05:37model. This happened multiple times. This is crazy.
00:05:41Why is this? Well,
00:05:43they posit that the reason why is because these large
00:05:49language models talk too damn much.
00:05:51They are over verbose to the point that they pretty much spin themselves into
00:05:55circles and get the wrong answer because of it. And in the study,
00:05:58they found that by constraining large models to produce brief responses,
00:06:02caveman responses improves accuracy by 26 percentage points and reduces
00:06:07performance gaps by up to two thirds.
00:06:09And in many cases by forcing these large language models to become more concise,
00:06:14more caveman like it completely switched that dynamic to where before they were
00:06:18losing to suddenly smaller models. And now they were defeating them.
00:06:21That's kind of wild, especially in context of this GitHub repo. Now,
00:06:26obviously these are open weight models. This is an Opus 4.6.
00:06:29This isn't Codex 5.4.
00:06:30Do these frontier models exhibit this exact same sort of behavior?
00:06:34We don't necessarily know for sure,
00:06:36but if you've seen any of these studies you understand usually what you see here
00:06:40tends to be repeated on some level with the frontier models.
00:06:44Maybe it's not this extreme, but there's probably something to it.
00:06:47Now the rest of the study goes into a lot of detail about how they run the tests,
00:06:51how they're trying to break out correlation versus causation and why they think
00:06:55this is a problem. And like I said before,
00:06:57they hypothesize that large models generate excessively verbose responses that
00:07:02obscure correct reasoning, a phenomenon they termed overthinking.
00:07:06It's just trying to put too much out there.
00:07:07Instead of just giving you the answer and getting out of its own way,
00:07:10it talks itself into the wrong answer literally.
00:07:13And they specifically say the learned tendency towards thoroughness becomes
00:07:17counterproductive, introducing error accumulation,
00:07:21brevity constraints help large models dramatically while barely affecting the
00:07:25smaller models. And an obvious question you should have is, well, why,
00:07:28why is this even the case? Why are these larger models having this issue?
00:07:31They point towards reinforcement learning.
00:07:34So when you train a new model,
00:07:36so imagine Opus 5.0 is in the process of being trained.
00:07:40Part of what they do is reinforcement learning.
00:07:42Now I don't know if Anthropic does it specifically,
00:07:44but this is how it's done for many models.
00:07:45Essentially they take the new model and they bring in a human to grade its
00:07:50answers. They show multiple answers and it says,
00:07:52I like this one more than this one. And they're saying in the study,
00:07:55chances are humans tend to like more verbose answers, more thorough answers.
00:08:00And because of that,
00:08:01these larger models are essentially trained to be more verbose rather than
00:08:05concise and even correct in some instances.
00:08:08But the big takeaway here is this is that brevity constraints completely reversed
00:08:12the performance hierarchies. So where they were losing before,
00:08:14now they were winning simply by telling them be more concise.
00:08:18They didn't change how they thought they didn't change anything under the hood.
00:08:20They just said, be a caveman. Now they weren't literally using this GitHub,
00:08:25but same exact thing.
00:08:28So this is why I think this is actually kind of interesting,
00:08:31not just a complete meme, you know,
00:08:32beyond the fact that there are some token, you know, positives here,
00:08:37saving 5% of tokens is nothing to laugh at,
00:08:39especially if you weren't on a max 20 plan.
00:08:41But if there's a potential scenario where we're actually getting better outputs
00:08:44because of it, especially on more straightforward questions,
00:08:47because if you dive into that study,
00:08:49it kind of breaks out like which questions they kind of had this issue with in
00:08:53this dynamic. It's interesting, very interesting,
00:08:56which is why I think this is kind of worth looking at.
00:08:58And it's also super simple to use. It's just a set of skills.
00:09:02Installing this literally is one line and then running it.
00:09:06We either invoke it with forward slash caveman, or just say something like,
00:09:09talk like a caveman caveman mode or less tokens, please. There's also levels to it.
00:09:13So we can go like ultra caveman, right? We like just came out of the ocean.
00:09:17We barely can stand up straight. And then we have all in light.
00:09:21So you can get different levels of caveman throughout the years.
00:09:24And it isn't a blanket thing.
00:09:25Either things like error messages are quoted exactly. And again,
00:09:29anything to do with code, anything to do with generation,
00:09:31anything under the hood stays the same. We're not changing how it really thinks.
00:09:35So overall, I think this is worth trying out. It's a single skill.
00:09:37It saves tokens and there's no real downside. And based on the study,
00:09:42there's actually potential upside here in terms of outputs.
00:09:45And if you don't like the whole caveman thing,
00:09:48I think this points towards at the very least putting some sort of line in your
00:09:52spot. MD that says, be concise, no filler,
00:09:56straight to the point, use less words,
00:09:59because clearly there's an advantage to that, not just in tokens,
00:10:03but like we saw potentially the actual answers it gives us.
00:10:06So that's where I'm going to leave you guys for today.
00:10:07What looked like on the surface to be just like a complete meme project,
00:10:11caveman Claude actually has some weight to it and some actual, you know,
00:10:15scientific rigor behind the Y,
00:10:17which I think actually makes this something worth, worth actually implementing.
00:10:21So as always, let me know in the comments, what you thought,
00:10:25make sure to check out chase AI.
00:10:26Plus if you want to get your hands on my Claude code masterclass,
00:10:29got more updates dropping in that space in the next couple of days.
00:10:33But besides that, I'll see you guys around.