$20,000. 2 Weeks. 16 Claude Agents. Anthropic's First AI-Built C Compiler

BBetter Stack
컴퓨터/소프트웨어경제 뉴스게임/e스포츠AI/미래기술

Transcript

00:00:00Anthropic have just done something huge, they let 16 Claude agents on the loose to build
00:00:05a C compiler and after running 24/7 for two weeks, they actually built one that could compile
00:00:11the Linux kernel and even run Doom, which is super impressive and definitely wasn't possible
00:00:16with the older versions of Opus 4. But people are calling this achievement clickbait and
00:00:22a half-truth because of the questionable techniques Anthropic used to get this result.
00:00:28So, did Anthropic cheat? Hit subscribe and let's find out.
00:00:31We'll split this video into three parts. First, we'll go through how the experiment was set up,
00:00:37then we'll go through the key findings, which I think every developer will learn a lot from,
00:00:42and finally, we'll go through if the results are valid, because I really have some opinions
00:00:47on how Anthropic was able to build this compiler. Okay, this experiment was carried out by Nicholas
00:00:52Carlini, who in my opinion is a very intelligent human being. I mean, let's take a look at how he
00:00:58set this up. So the actual project lived in a directory called Upstream, which was mounted
00:01:03to 16 different Docker containers. I know there are only four here, but let's imagine there are 16.
00:01:08And each one of these Docker containers contained a version of Claude code running Opus 4.6 and
00:01:15would clone the Upstream repo to Workspace and would make all the changes in Workspace and then
00:01:21push to Upstream. This was really clever because each agent could work in isolation without
00:01:27affecting the work of the other agents. Now, if there was ever a merge conflict, then Claude
00:01:32would be clever enough to resolve it and then push it back up to Upstream. Each agent will pick from
00:01:38some tasks. Now, I'm not sure if these tasks were generated by a human or generated by the agent
00:01:44based on running some tests, but there were some tasks that existed with names and each agent would
00:01:50take a new task and whenever it took a new task, it would create a new session. So in order to keep
00:01:56these agents running for a long time, a Ralph loop was used and so the agents would work on the task,
00:02:02finish the task, push to Upstream, then pick a new task with a fresh session and keep doing that over
00:02:08and over again. Now, if you've watched our video on Ralph, you'll know that the key to having long
00:02:13running agents is to have clearly defined tasks. But if you have 16 agents running at the same time,
00:02:19how do you prevent them from picking the exact same task? Task lock-in. The way this works is
00:02:24somewhere not mentioned by the author, a list of tasks exists and an agent will pick a task and give
00:02:30it a text file matching the name of that task, will create a commit to lock that task so only they can
00:02:36work on that task and push it to the Upstream directory. If another agent picks the same task
00:02:42and makes the same text file, if they try to push it to the Upstream repo, Git will reject it saying
00:02:48that that file already exists and then they'll have to work on a different task. And this is the basis
00:02:53of how Carlini stress tested the ability of long running agents powered by Opus 4.6 and the results
00:03:00are truly amazing. But from doing this experiment, he has found some interesting things that I think
00:03:07every single developer can learn from. The first thing is to build a test harness or a script that
00:03:12runs different types of tests because when Nick was running the experiment, yes, we're on first name
00:03:17terms now, he experienced Claude breaking existing features whenever a new feature was worked on.
00:03:23So he built a testing harness consisting of high quality tests from popular open source repos like
00:03:29SQLite, libjpg and Redis. And to prevent contacts pollution, he made sure the test harness only
00:03:35outputted logs that were useful to the agent. So basically error logs and created a file with all
00:03:41the other type of logs so that Claude could look into it whenever it needed. However, with thousands
00:03:47of tests, it would take agents hours to run the whole test suite when they could be using that
00:03:52time to do something else. So this is where Nick did something really clever. He added a fast flag
00:03:58to his testing harness, meaning an agent will only run either 1% or 10% of the total tests based on a
00:04:05figure that Nick wanted. And if each agent ran 10%, then that would be 160% of the tests, which is more
00:04:13than enough, but this isn't a bad thing. And the way it worked is the tests, so the specific tests
00:04:19that were run by each agent were randomised, but the seed number was the same, making it
00:04:25deterministic. So each agent will have the exact same random tests and eventually go through all
00:04:31the tests suite much faster than if they were running the whole thing by themselves. The next
00:04:36point is also clever, but a bit of a controversial one since it's to make use of existing technology.
00:04:41So far, each agent has been running unit tests from a bunch of existing open source projects,
00:04:46which was working well, splitting them up into 1% or 10% chunks. But when it came to compiling the
00:04:53Linux kernel, since these source files aren't individual unit tests, things became a bit
00:04:58difficult because each agent will try to compile the whole thing, come up with the same error,
00:05:04and so try to fix it and overwrite each other's fix. So the way Nick got around this was again,
00:05:09to have each agent run a percentage of the compilation and then have GCC, so the GNU
00:05:15compiler, run the rest of it. Nick called GCC the oracle since the Linux kernel should compile
00:05:22perfectly with it. So if an agent compiled a section of the Linux kernel with its own compiler,
00:05:27so a different section for each agent and the rest with GCC, if something broke, it was definitely
00:05:34the agent's compiler and not GCC, and therefore the agent would just fix that thing instead of fixing
00:05:40a bug from another agent. Now this is controversial because it's using an existing compiler to do
00:05:46something that Claude was asked to do from scratch. But we'll talk more about this towards the end of
00:05:51the video. Let's move on to the next point, which is to give your agent memory. Since new tasks are
00:05:57worked on by fresh Claude sessions that have pretty much no memory of what was done before it,
00:06:03Nick found it useful to update the readme file and to have different progress files with instructions
00:06:09of where things left off and the progress of the project so that new sessions would have a good
00:06:13base to start off from and not introduce bugs that have already been fixed before. And the final more
00:06:18obvious point is to give your agents different roles. The beauty of having multiple agents work
00:06:23on a code base in parallel is that multiple things can be done to the exact same piece of the code at
00:06:29the same time. So when new code wasn't being written, Nick gave unique roles to agents like
00:06:35one to check for duplicated code, another to find a way of making the code as performant as possible,
00:06:40and he even got one to critique the design from the perspective of a Rust developer who I hope didn't
00:06:45announce to the other agents that it was a Rust developer. But as successful as this project was,
00:06:51the real question is, did Anthropic cheat to get this result? Well, kind of. So the task was to
00:06:57build a C compiler from scratch and the agent didn't have access to the internet, so it came
00:07:03up with all the code itself. Or did it? Because it did have access to the test suites of open
00:07:10source projects and it had access to the compiled version of GCC. So technically it could have poked
00:07:16and prodded the GCC compiler, giving it inputs and inspecting the outputs and using that to direct the
00:07:24design of its own compiler written in Rust. But to be fair, if I was building a C compiler from
00:07:31scratch, I would do the same thing. I would look at existing compilers, see how they were implemented
00:07:36and use that to shape the direction of my own compiler. Now, if I was building a compiler for
00:07:41a brand new language, then of course things would be much difficult and maybe this would be a really
00:07:47good test to do for Claude to see if it's actually good at creating compilers from scratch. Maybe
00:07:53that's another idea for Nick to try out, but let's move on to talk about the autonomous nature of the
00:07:57experiment since that was also tested. And to be fair, yes, Claude did write all of the code, but
00:08:04it had some heavy steering from a human. A human decided on what test suite to run. A human started
00:08:11the loop and decided to use routes. A human was the one that built the test harness and gave agents
00:08:16specific roles. So while this is far from someone telling Claude to build a compiler and leaving it
00:08:22to run forever and ever, I wouldn't say the code was written by an agent that was a hundred percent
00:08:28autonomous because how good would the compiler have been if a human wasn't involved in the first place?
00:08:33And even with all the systems in place designed by a human, the Claude C compiler did have some
00:08:39key limitations. For example, it used the assembler and linker from GCC because the one it created was
00:08:46too buggy. It also needed GCC's 16-bit x86 compiler in order to boot up Linux. And to top that all off,
00:08:54the code wasn't very efficient. The most optimized version of the Claude's compiler was less
00:09:00performant than the least optimized version of the GCC compiler. So it looks like developers
00:09:05aren't going anywhere anytime soon, or at least for now.

Key Takeaway

While Anthropic's Claude agents successfully built a complex C compiler capable of running Doom, the project's success relied heavily on human-designed infrastructure, existing compiler oracles, and significant computational costs.

Highlights

Anthropic researcher Nicholas Carlini used 16 Claude agents (Opus 4.6) running 24/7 for two weeks to build a functional C compiler.

The AI-built compiler successfully compiled the Linux kernel and ran the game Doom, showcasing high-level code generation capabilities.

Strategic use of a 'test harness' with randomized seeds allowed the agents to run massive test suites 16 times faster than a single agent.

The experiment utilized an 'Oracle' (GCC) to provide ground truth for compiling the Linux kernel, effectively isolating AI-specific errors.

Despite the success, critics label the project a 'half-truth' due to heavy human steering and reliance on existing GCC components for assembly and linking.

Performance remains a hurdle, as the most optimized version of the Claude-built compiler was slower than the least optimized version of GCC.

Timeline

Introduction and the $20,000 Experiment

The video introduces Anthropic's ambitious experiment where 16 Claude agents were tasked with building a C compiler from scratch over a two-week period. This project cost approximately $20,000 and resulted in a compiler capable of running the Linux kernel and Doom, an achievement impossible with previous AI models. However, the speaker notes that the AI community is divided, with some calling the results 'clickbait' due to the specific techniques used. The introduction sets the stage for a three-part analysis covering the technical setup, key findings for developers, and a final verdict on the project's validity. This context is crucial for understanding whether the AI truly achieved autonomy or was heavily assisted by human intervention.

Technical Setup: Docker, RALF Loops, and Task Lock-in

Researcher Nicholas Carlini designed a sophisticated environment where 16 Docker containers accessed a shared 'Upstream' repository to facilitate parallel development. To maintain continuous operation, a 'RALF loop' was implemented, allowing agents to pick tasks, complete them, and immediately start fresh sessions to avoid context degradation. A clever 'task lock-in' mechanism prevented agents from duplicating work by creating specific text files that acted as flags within the Git repository. This setup ensured that each agent could work in isolation while Claude's internal logic handled any resulting merge conflicts during the push to the main repo. This section highlights the engineering complexity required to manage a swarm of LLMs working on a single, large-scale codebase.

Optimization Strategies: Test Harnesses and Deterministic Scaling

To prevent the AI from breaking existing features, Carlini integrated a robust test harness utilizing high-quality tests from open-source projects like SQLite and Redis. A major innovation was the use of a 'fast flag' that assigned each agent a randomized subset (1% to 10%) of the total test suite. By using a consistent seed number, the process remained deterministic while ensuring that the collective work of 16 agents covered the entire test suite much faster. This approach prevented context pollution by only showing agents relevant error logs rather than overwhelming them with thousands of successful test results. These insights serve as a blueprint for developers looking to scale AI-driven software testing and quality assurance.

The GCC Oracle and Knowledge Management

One of the most controversial aspects of the project was the use of GCC as an 'Oracle' to help the agents compile the Linux kernel. When an agent's code failed, the system would compare the result against GCC; if GCC succeeded, the agent knew the bug was internal to its own compiler design. To manage long-term project memory, the human lead instructed agents to update 'README' and progress files, ensuring new sessions understood previous milestones. Furthermore, agents were assigned specialized roles, such as performance optimization or code duplication checking, and one even acted as a Rust developer to critique the design. This collaborative 'agentic' workflow allowed for simultaneous improvements across different facets of the compiler's architecture.

Final Verdict: Autonomy vs. Human Steering

The video concludes by addressing whether Anthropic 'cheated' by providing the agents with access to GCC and existing test suites. While the agents wrote the actual Rust code for the compiler, they relied on GCC's assembler and linker because the AI-generated versions were too buggy to function. The speaker emphasizes that the project was not truly autonomous, as a human defined the tasks, built the testing infrastructure, and directed the high-level strategy. Additionally, the final product's performance was significantly lower than that of traditional compilers, with the most optimized AI version trailing behind the least optimized GCC build. Ultimately, while impressive, the experiment suggests that human developers are still essential for high-stakes, high-performance systems programming.

Community Posts

View all posts