Caveman Claude Code Is the New Meta (Here's the Science)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

컴퓨터/소프트웨어경영/리더십AI/미래기술

Transcript

00:00:00Making cloud code talk like a caveman might not only save you tokens.

00:00:04It could actually improve your performance as well. Now on the surface,

00:00:07this sounds like a complete meme. We have a GitHub repo called caveman.

00:00:12That's gotten 5,000 stars in 72 hours.

00:00:15And all it does is force cloud code to talk like a Neanderthal.

00:00:19It trims out all the filler. The idea is that by making it more concise,

00:00:24we save a ton of tokens in the process,

00:00:27but buried in this repo is a link to this research paper that just came out a few

00:00:31weeks ago,

00:00:31which tells us if we force our large language models to be more concise,

00:00:36we don't only save tokens, but we can dramatically improve their performance.

00:00:40So today I'm going to break down this entire caveman skill.

00:00:42I'm going to explain what it actually buys you because the numbers in the repo

00:00:46are a little misleading and we're going to talk through this research paper so you

00:00:50can understand what this actually means for you. So this is caveman,

00:00:54our why say many word when few word do trick repo.

00:00:58Now, right off the bat, what is it doing? Pretty simple,

00:01:02cutting out the filler cloud code. Now it talks like a caveman.

00:01:07It gives us some before and after examples shows us the token difference and even

00:01:11has a full benchmark list showing the task. It gave cloud code,

00:01:15explain react, reenter bug, the normal tokens being used,

00:01:19the caveman tokens and the amount saved.

00:01:21Now the numbers put forth in this repo are kind of insane.

00:01:23So they are claiming that with this skill,

00:01:26we are going to cut 75% of output tokens while keeping full technical

00:01:30accuracy.

00:01:31This caveman does not change how cloud code reasons under the hood.

00:01:35It doesn't change how it actually generates code. None of that gets changed.

00:01:38It's just the output. What you see as a response.

00:01:41It also includes a companion tool that compresses your memory files.

00:01:45Think claud.md into caveman speak.

00:01:47And that is supposed to reduce our input tokens by 45% every session.

00:01:52Now let's be clear. You are not cutting 75% of your output tokens at large,

00:01:57and 45% of your input tokens at large at all. That is completely not true.

00:02:01Even though we can see these things that say, Hey,

00:02:03it saves 87% of tokens on how it explains a react reenter bug.

00:02:07The prompt you get back from claud code, the response itself,

00:02:11the text is just a small portion of the output tokens at large,

00:02:15just like the memory files,

00:02:17like claud.md is just a small portion of the input at large.

00:02:21So let's be very clear about what this is actually buying us on a token scale.

00:02:25You are not saving 80% of your total tokens. And to make it a little more clear,

00:02:28let's break down your average hundred thousand token claud code session. Now,

00:02:32I understand every session is a little different, but just work with me here.

00:02:36We have a hundred thousand token session, and it's broken up into two parts.

00:02:40The input, which is the lion share.

00:02:42That's 75,000 tokens in the output, which is 25%.

00:02:46Now caveman is claiming we're going to reduce output by 75%.

00:02:51That is not true. If we take a look at output, it's really in three parts, right?

00:02:56We have tool calls, taking up a portion of it, code blocks,

00:02:59like the actual code generation, taking a portion of it.

00:03:02And then the actual pros responses, this response,

00:03:06that text response internal, that's what caveman is adjusting.

00:03:10That's what it's reducing. It can reduce 75% of that. You know,

00:03:13if we go down here, we can see, okay,

00:03:16so normally the pros takes up six K tokens with caveman.

00:03:20We save 4,000 tokens. So we get a 4% reduction. That's still really good.

00:03:25If we're saving 4% of our total tokens over the course of the week,

00:03:29that certainly adds up,

00:03:30especially in the current environment where we are all so conscious of our usage.

00:03:33But understand this is not 87%. It's 70%,

00:03:3860% of one portion of one portion of the total session.

00:03:43Furthermore,

00:03:44if you look at the inputs and it talks about the caveman compression saving 45%,

00:03:49again, not really.

00:03:50We're talking about the system prompt area and only certain parts of the system

00:03:54prompt. So total here, right? We're saving what? Maybe a thousand tokens,

00:03:58maybe 2000 tokens. And over the course, again, I'm an entire session.

00:04:03If I say 5,000 tokens, 5% of every session, that's great, good stuff,

00:04:07but it's not these gaudy numbers. So understand that going in,

00:04:13this is an on the margin play. This isn't totally change.

00:04:15You're not going to be able to go from basically five X max plan to 20 X max

00:04:19plan because we're saving 75%. No, no, no, no,

00:04:22but there's still tons of value to be add here and even more value to be

00:04:25extracted. Once we take a look at the study, it's kind of buried in here.

00:04:29There's one little section dedicated to it,

00:04:31but this is a study called brevity constraints,

00:04:34reverse performance hierarchies and language models.

00:04:36And this came out in early March of this year.

00:04:38So I will put a link to the study down in the description if you want to check it

00:04:41out, but let's just talk about it really quick because this is really interesting.

00:04:45Because the idea and the expectation is bigger model,

00:04:49better than smaller model always. Well,

00:04:53not exactly, not according to this study.

00:04:56So in this study they evaluated 31 models across 1500

00:05:01problems,

00:05:02and they identified the mechanism as spontaneous scale dependent verbosity that

00:05:07introduces errors through over elaboration. What the heck does that mean?

00:05:11That means on nearly 8% of the problems across these 1500 problems and

00:05:1631 models, the larger language models,

00:05:19the ones with more parameters underperformed smaller ones by 28

00:05:24percentage points, despite a hundred times more parameters in some cases.

00:05:28So you had scenarios where again, this is with all open weight models.

00:05:32You had a 2 billion parameter model outperforming a 400 billion parameter

00:05:37model. This happened multiple times. This is crazy.

00:05:41Why is this? Well,

00:05:43they posit that the reason why is because these large

00:05:49language models talk too damn much.

00:05:51They are over verbose to the point that they pretty much spin themselves into

00:05:55circles and get the wrong answer because of it. And in the study,

00:05:58they found that by constraining large models to produce brief responses,

00:06:02caveman responses improves accuracy by 26 percentage points and reduces

00:06:07performance gaps by up to two thirds.

00:06:09And in many cases by forcing these large language models to become more concise,

00:06:14more caveman like it completely switched that dynamic to where before they were

00:06:18losing to suddenly smaller models. And now they were defeating them.

00:06:21That's kind of wild, especially in context of this GitHub repo. Now,

00:06:26obviously these are open weight models. This is an Opus 4.6.

00:06:29This isn't Codex 5.4.

00:06:30Do these frontier models exhibit this exact same sort of behavior?

00:06:34We don't necessarily know for sure,

00:06:36but if you've seen any of these studies you understand usually what you see here

00:06:40tends to be repeated on some level with the frontier models.

00:06:44Maybe it's not this extreme, but there's probably something to it.

00:06:47Now the rest of the study goes into a lot of detail about how they run the tests,

00:06:51how they're trying to break out correlation versus causation and why they think

00:06:55this is a problem. And like I said before,

00:06:57they hypothesize that large models generate excessively verbose responses that

00:07:02obscure correct reasoning, a phenomenon they termed overthinking.

00:07:06It's just trying to put too much out there.

00:07:07Instead of just giving you the answer and getting out of its own way,

00:07:10it talks itself into the wrong answer literally.

00:07:13And they specifically say the learned tendency towards thoroughness becomes

00:07:17counterproductive, introducing error accumulation,

00:07:21brevity constraints help large models dramatically while barely affecting the

00:07:25smaller models. And an obvious question you should have is, well, why,

00:07:28why is this even the case? Why are these larger models having this issue?

00:07:31They point towards reinforcement learning.

00:07:34So when you train a new model,

00:07:36so imagine Opus 5.0 is in the process of being trained.

00:07:40Part of what they do is reinforcement learning.

00:07:42Now I don't know if Anthropic does it specifically,

00:07:44but this is how it's done for many models.

00:07:45Essentially they take the new model and they bring in a human to grade its

00:07:50answers. They show multiple answers and it says,

00:07:52I like this one more than this one. And they're saying in the study,

00:07:55chances are humans tend to like more verbose answers, more thorough answers.

00:08:00And because of that,

00:08:01these larger models are essentially trained to be more verbose rather than

00:08:05concise and even correct in some instances.

00:08:08But the big takeaway here is this is that brevity constraints completely reversed

00:08:12the performance hierarchies. So where they were losing before,

00:08:14now they were winning simply by telling them be more concise.

00:08:18They didn't change how they thought they didn't change anything under the hood.

00:08:20They just said, be a caveman. Now they weren't literally using this GitHub,

00:08:25but same exact thing.

00:08:28So this is why I think this is actually kind of interesting,

00:08:31not just a complete meme, you know,

00:08:32beyond the fact that there are some token, you know, positives here,

00:08:37saving 5% of tokens is nothing to laugh at,

00:08:39especially if you weren't on a max 20 plan.

00:08:41But if there's a potential scenario where we're actually getting better outputs

00:08:44because of it, especially on more straightforward questions,

00:08:47because if you dive into that study,

00:08:49it kind of breaks out like which questions they kind of had this issue with in

00:08:53this dynamic. It's interesting, very interesting,

00:08:56which is why I think this is kind of worth looking at.

00:08:58And it's also super simple to use. It's just a set of skills.

00:09:02Installing this literally is one line and then running it.

00:09:06We either invoke it with forward slash caveman, or just say something like,

00:09:09talk like a caveman caveman mode or less tokens, please. There's also levels to it.

00:09:13So we can go like ultra caveman, right? We like just came out of the ocean.

00:09:17We barely can stand up straight. And then we have all in light.

00:09:21So you can get different levels of caveman throughout the years.

00:09:24And it isn't a blanket thing.

00:09:25Either things like error messages are quoted exactly. And again,

00:09:29anything to do with code, anything to do with generation,

00:09:31anything under the hood stays the same. We're not changing how it really thinks.

00:09:35So overall, I think this is worth trying out. It's a single skill.

00:09:37It saves tokens and there's no real downside. And based on the study,

00:09:42there's actually potential upside here in terms of outputs.

00:09:45And if you don't like the whole caveman thing,

00:09:48I think this points towards at the very least putting some sort of line in your

00:09:52spot. MD that says, be concise, no filler,

00:09:56straight to the point, use less words,

00:09:59because clearly there's an advantage to that, not just in tokens,

00:10:03but like we saw potentially the actual answers it gives us.

00:10:06So that's where I'm going to leave you guys for today.

00:10:07What looked like on the surface to be just like a complete meme project,

00:10:11caveman Claude actually has some weight to it and some actual, you know,

00:10:15scientific rigor behind the Y,

00:10:17which I think actually makes this something worth, worth actually implementing.

00:10:21So as always, let me know in the comments, what you thought,

00:10:25make sure to check out chase AI.

00:10:26Plus if you want to get your hands on my Claude code masterclass,

00:10:29got more updates dropping in that space in the next couple of days.

00:10:33But besides that, I'll see you guys around.

Key Takeaway

Implementing brevity constraints like Caveman mode reduces token usage by 5% and improves LLM accuracy by up to 26 percentage points by preventing larger models from overthinking themselves into errors.

Highlights

Caveman Claude forces LLMs to speak in concise, Neanderthal-style prose to eliminate linguistic filler and save tokens.
A research study covering 31 models and 1500 problems found that larger models underperform smaller ones by 28 percentage points due to scale-dependent verbosity.
Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds.
The Caveman skill reduces total session tokens by approximately 5% by targeting prose responses and memory file inputs.
Human preference for thoroughness during reinforcement learning trains larger models to be over-verbose, leading to error accumulation termed overthinking.
Brevity constraints can reverse performance hierarchies, allowing a 2 billion parameter model to outperform a 400 billion parameter model.

Timeline

The Caveman Claude Repository and Token Savings

The Caveman GitHub repository gained 5,000 stars in 72 hours by forcing Claude to use concise, filler-free language.
The tool includes a companion feature to compress memory files like claud.md into caveman speak to reduce input tokens.
Initial claims of 75% output and 45% input token savings are based on specific text blocks rather than total session volume.

While the repository promotes high percentage reductions, these numbers apply specifically to prose responses and system prompts. In a standard coding session, actual savings are more modest because code blocks and tool calls remain unchanged. The primary value lies in trimming the conversational overhead that occupies a fraction of the total token window.

Token Math in a Standard Claude Session

A typical 100,000 token session consists of roughly 75,000 input tokens and 25,000 output tokens.
Caveman mode yields a 4% to 5% total token reduction by targeting only the prose portion of the output.
Targeting the system prompt and memory files results in an additional saving of 1,000 to 2,000 input tokens per session.

Total session tokens are dominated by input data and generated code, which Caveman mode does not alter to ensure technical accuracy. By reducing a 6,000-token prose response by 75%, the user saves roughly 4,000 tokens. While this does not allow for a 20x increase in usage limits, it provides a consistent marginal gain for heavy users.

The Science of Brevity Constraints and Accuracy

A March 2026 study titled 'Brevity constraints, reverse performance hierarchies in language models' identifies over-elaboration as a primary cause of LLM error.
Large language models often spin themselves into circles, talking until they arrive at a wrong answer despite having 100x more parameters than smaller models.
Forcing brevity essentially resets the performance hierarchy, allowing larger models to reclaim their dominant status by getting out of their own way.

The research evaluated 31 open-weight models and found that scale-dependent verbosity introduces errors in approximately 8% of problems. When these models are restricted to caveman-like responses, their reasoning becomes clearer and more accurate. This phenomenon suggests that frontier models likely suffer from similar 'overthinking' issues that can be mitigated through prompting constraints.

Why LLMs Suffer From Overthinking

Reinforcement learning from human feedback (RLHF) encourages models to be thorough because humans tend to rate verbose answers more highly.
The learned tendency toward thoroughness becomes counterproductive, introducing a cumulative error effect as the model generates more text.
Brevity constraints significantly help large models while having almost no impact on the performance of smaller, less verbose models.

The training process creates a bias where 'longer' is equated with 'better' or 'more helpful' by human graders. This bias forces models to prioritize word count and detail over directness, which often leads to logical lapses. By manually enforcing a brevity constraint, users bypass this trained behavior and force the model to prioritize its core reasoning capabilities.

Implementation and Practical Application

The Caveman skill is installed via a single command and can be toggled using slash commands or natural language triggers.
Multiple levels of brevity exist, ranging from 'Ultra Caveman' to 'Lite' modes, to suit different user preferences.
Adding 'be concise' or 'no filler' to a system configuration file like claud.md provides similar benefits without using the specific GitHub repo.

Technical generation, such as code blocks and error messages, is quoted exactly and remains unaffected by the brevity filter. Even if users choose not to adopt the caveman persona, the underlying science suggests that any prompt requiring directness will likely yield higher quality answers. Implementing these constraints is a low-risk strategy with potential upsides in both cost and performance.

Community Posts

Write about this video