$20,000. 2 Weeks. 16 Claude Agents. Anthropic's First AI-Built C Compiler

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

컴퓨터/소프트웨어경제 뉴스게임/e스포츠AI/미래기술

Transcript

00:00:00Anthropic have just done something huge, they let 16 Claude agents on the loose to build

00:00:05a C compiler and after running 24/7 for two weeks, they actually built one that could compile

00:00:11the Linux kernel and even run Doom, which is super impressive and definitely wasn't possible

00:00:16with the older versions of Opus 4. But people are calling this achievement clickbait and

00:00:22a half-truth because of the questionable techniques Anthropic used to get this result.

00:00:28So, did Anthropic cheat? Hit subscribe and let's find out.

00:00:31We'll split this video into three parts. First, we'll go through how the experiment was set up,

00:00:37then we'll go through the key findings, which I think every developer will learn a lot from,

00:00:42and finally, we'll go through if the results are valid, because I really have some opinions

00:00:47on how Anthropic was able to build this compiler. Okay, this experiment was carried out by Nicholas

00:00:52Carlini, who in my opinion is a very intelligent human being. I mean, let's take a look at how he

00:00:58set this up. So the actual project lived in a directory called Upstream, which was mounted

00:01:03to 16 different Docker containers. I know there are only four here, but let's imagine there are 16.

00:01:08And each one of these Docker containers contained a version of Claude code running Opus 4.6 and

00:01:15would clone the Upstream repo to Workspace and would make all the changes in Workspace and then

00:01:21push to Upstream. This was really clever because each agent could work in isolation without

00:01:27affecting the work of the other agents. Now, if there was ever a merge conflict, then Claude

00:01:32would be clever enough to resolve it and then push it back up to Upstream. Each agent will pick from

00:01:38some tasks. Now, I'm not sure if these tasks were generated by a human or generated by the agent

00:01:44based on running some tests, but there were some tasks that existed with names and each agent would

00:01:50take a new task and whenever it took a new task, it would create a new session. So in order to keep

00:01:56these agents running for a long time, a Ralph loop was used and so the agents would work on the task,

00:02:02finish the task, push to Upstream, then pick a new task with a fresh session and keep doing that over

00:02:08and over again. Now, if you've watched our video on Ralph, you'll know that the key to having long

00:02:13running agents is to have clearly defined tasks. But if you have 16 agents running at the same time,

00:02:19how do you prevent them from picking the exact same task? Task lock-in. The way this works is

00:02:24somewhere not mentioned by the author, a list of tasks exists and an agent will pick a task and give

00:02:30it a text file matching the name of that task, will create a commit to lock that task so only they can

00:02:36work on that task and push it to the Upstream directory. If another agent picks the same task

00:02:42and makes the same text file, if they try to push it to the Upstream repo, Git will reject it saying

00:02:48that that file already exists and then they'll have to work on a different task. And this is the basis

00:02:53of how Carlini stress tested the ability of long running agents powered by Opus 4.6 and the results

00:03:00are truly amazing. But from doing this experiment, he has found some interesting things that I think

00:03:07every single developer can learn from. The first thing is to build a test harness or a script that

00:03:12runs different types of tests because when Nick was running the experiment, yes, we're on first name

00:03:17terms now, he experienced Claude breaking existing features whenever a new feature was worked on.

00:03:23So he built a testing harness consisting of high quality tests from popular open source repos like

00:03:29SQLite, libjpg and Redis. And to prevent contacts pollution, he made sure the test harness only

00:03:35outputted logs that were useful to the agent. So basically error logs and created a file with all

00:03:41the other type of logs so that Claude could look into it whenever it needed. However, with thousands

00:03:47of tests, it would take agents hours to run the whole test suite when they could be using that

00:03:52time to do something else. So this is where Nick did something really clever. He added a fast flag

00:03:58to his testing harness, meaning an agent will only run either 1% or 10% of the total tests based on a

00:04:05figure that Nick wanted. And if each agent ran 10%, then that would be 160% of the tests, which is more

00:04:13than enough, but this isn't a bad thing. And the way it worked is the tests, so the specific tests

00:04:19that were run by each agent were randomised, but the seed number was the same, making it

00:04:25deterministic. So each agent will have the exact same random tests and eventually go through all

00:04:31the tests suite much faster than if they were running the whole thing by themselves. The next

00:04:36point is also clever, but a bit of a controversial one since it's to make use of existing technology.

00:04:41So far, each agent has been running unit tests from a bunch of existing open source projects,

00:04:46which was working well, splitting them up into 1% or 10% chunks. But when it came to compiling the

00:04:53Linux kernel, since these source files aren't individual unit tests, things became a bit

00:04:58difficult because each agent will try to compile the whole thing, come up with the same error,

00:05:04and so try to fix it and overwrite each other's fix. So the way Nick got around this was again,

00:05:09to have each agent run a percentage of the compilation and then have GCC, so the GNU

00:05:15compiler, run the rest of it. Nick called GCC the oracle since the Linux kernel should compile

00:05:22perfectly with it. So if an agent compiled a section of the Linux kernel with its own compiler,

00:05:27so a different section for each agent and the rest with GCC, if something broke, it was definitely

00:05:34the agent's compiler and not GCC, and therefore the agent would just fix that thing instead of fixing

00:05:40a bug from another agent. Now this is controversial because it's using an existing compiler to do

00:05:46something that Claude was asked to do from scratch. But we'll talk more about this towards the end of

00:05:51the video. Let's move on to the next point, which is to give your agent memory. Since new tasks are

00:05:57worked on by fresh Claude sessions that have pretty much no memory of what was done before it,

00:06:03Nick found it useful to update the readme file and to have different progress files with instructions

00:06:09of where things left off and the progress of the project so that new sessions would have a good

00:06:13base to start off from and not introduce bugs that have already been fixed before. And the final more

00:06:18obvious point is to give your agents different roles. The beauty of having multiple agents work

00:06:23on a code base in parallel is that multiple things can be done to the exact same piece of the code at

00:06:29the same time. So when new code wasn't being written, Nick gave unique roles to agents like

00:06:35one to check for duplicated code, another to find a way of making the code as performant as possible,

00:06:40and he even got one to critique the design from the perspective of a Rust developer who I hope didn't

00:06:45announce to the other agents that it was a Rust developer. But as successful as this project was,

00:06:51the real question is, did Anthropic cheat to get this result? Well, kind of. So the task was to

00:06:57build a C compiler from scratch and the agent didn't have access to the internet, so it came

00:07:03up with all the code itself. Or did it? Because it did have access to the test suites of open

00:07:10source projects and it had access to the compiled version of GCC. So technically it could have poked

00:07:16and prodded the GCC compiler, giving it inputs and inspecting the outputs and using that to direct the

00:07:24design of its own compiler written in Rust. But to be fair, if I was building a C compiler from

00:07:31scratch, I would do the same thing. I would look at existing compilers, see how they were implemented

00:07:36and use that to shape the direction of my own compiler. Now, if I was building a compiler for

00:07:41a brand new language, then of course things would be much difficult and maybe this would be a really

00:07:47good test to do for Claude to see if it's actually good at creating compilers from scratch. Maybe

00:07:53that's another idea for Nick to try out, but let's move on to talk about the autonomous nature of the

00:07:57experiment since that was also tested. And to be fair, yes, Claude did write all of the code, but

00:08:04it had some heavy steering from a human. A human decided on what test suite to run. A human started

00:08:11the loop and decided to use routes. A human was the one that built the test harness and gave agents

00:08:16specific roles. So while this is far from someone telling Claude to build a compiler and leaving it

00:08:22to run forever and ever, I wouldn't say the code was written by an agent that was a hundred percent

00:08:28autonomous because how good would the compiler have been if a human wasn't involved in the first place?

00:08:33And even with all the systems in place designed by a human, the Claude C compiler did have some

00:08:39key limitations. For example, it used the assembler and linker from GCC because the one it created was

00:08:46too buggy. It also needed GCC's 16-bit x86 compiler in order to boot up Linux. And to top that all off,

00:08:54the code wasn't very efficient. The most optimized version of the Claude's compiler was less

00:09:00performant than the least optimized version of the GCC compiler. So it looks like developers

00:09:05aren't going anywhere anytime soon, or at least for now.

Key Takeaway

While Anthropic's Claude agents successfully built a complex C compiler capable of running Doom, the project's success relied heavily on human-designed infrastructure, existing compiler oracles, and significant computational costs.

Highlights

Anthropic researcher Nicholas Carlini used 16 Claude agents (Opus 4.6) running 24/7 for two weeks to build a functional C compiler.
The AI-built compiler successfully compiled the Linux kernel and ran the game Doom, showcasing high-level code generation capabilities.
Strategic use of a 'test harness' with randomized seeds allowed the agents to run massive test suites 16 times faster than a single agent.
The experiment utilized an 'Oracle' (GCC) to provide ground truth for compiling the Linux kernel, effectively isolating AI-specific errors.
Despite the success, critics label the project a 'half-truth' due to heavy human steering and reliance on existing GCC components for assembly and linking.
Performance remains a hurdle, as the most optimized version of the Claude-built compiler was slower than the least optimized version of GCC.

Timeline

Introduction and the $20,000 Experiment

The video introduces Anthropic's ambitious experiment where 16 Claude agents were tasked with building a C compiler from scratch over a two-week period. This project cost approximately $20,000 and resulted in a compiler capable of running the Linux kernel and Doom, an achievement impossible with previous AI models. However, the speaker notes that the AI community is divided, with some calling the results 'clickbait' due to the specific techniques used. The introduction sets the stage for a three-part analysis covering the technical setup, key findings for developers, and a final verdict on the project's validity. This context is crucial for understanding whether the AI truly achieved autonomy or was heavily assisted by human intervention.

Technical Setup: Docker, RALF Loops, and Task Lock-in

Researcher Nicholas Carlini designed a sophisticated environment where 16 Docker containers accessed a shared 'Upstream' repository to facilitate parallel development. To maintain continuous operation, a 'RALF loop' was implemented, allowing agents to pick tasks, complete them, and immediately start fresh sessions to avoid context degradation. A clever 'task lock-in' mechanism prevented agents from duplicating work by creating specific text files that acted as flags within the Git repository. This setup ensured that each agent could work in isolation while Claude's internal logic handled any resulting merge conflicts during the push to the main repo. This section highlights the engineering complexity required to manage a swarm of LLMs working on a single, large-scale codebase.

Optimization Strategies: Test Harnesses and Deterministic Scaling

To prevent the AI from breaking existing features, Carlini integrated a robust test harness utilizing high-quality tests from open-source projects like SQLite and Redis. A major innovation was the use of a 'fast flag' that assigned each agent a randomized subset (1% to 10%) of the total test suite. By using a consistent seed number, the process remained deterministic while ensuring that the collective work of 16 agents covered the entire test suite much faster. This approach prevented context pollution by only showing agents relevant error logs rather than overwhelming them with thousands of successful test results. These insights serve as a blueprint for developers looking to scale AI-driven software testing and quality assurance.

The GCC Oracle and Knowledge Management

One of the most controversial aspects of the project was the use of GCC as an 'Oracle' to help the agents compile the Linux kernel. When an agent's code failed, the system would compare the result against GCC; if GCC succeeded, the agent knew the bug was internal to its own compiler design. To manage long-term project memory, the human lead instructed agents to update 'README' and progress files, ensuring new sessions understood previous milestones. Furthermore, agents were assigned specialized roles, such as performance optimization or code duplication checking, and one even acted as a Rust developer to critique the design. This collaborative 'agentic' workflow allowed for simultaneous improvements across different facets of the compiler's architecture.

Final Verdict: Autonomy vs. Human Steering

The video concludes by addressing whether Anthropic 'cheated' by providing the agents with access to GCC and existing test suites. While the agents wrote the actual Rust code for the compiler, they relied on GCC's assembler and linker because the AI-generated versions were too buggy to function. The speaker emphasizes that the project was not truly autonomous, as a human defined the tasks, built the testing infrastructure, and directed the high-level strategy. Additionally, the final product's performance was significantly lower than that of traditional compilers, with the most optimized AI version trailing behind the least optimized GCC build. Ultimately, while impressive, the experiment suggests that human developers are still essential for high-stakes, high-performance systems programming.

Community Posts

Write about this video