00:00:00Anthropic have just done something huge, they let 16 Claude agents on the loose to build
00:00:05a C compiler and after running 24/7 for two weeks, they actually built one that could compile
00:00:11the Linux kernel and even run Doom, which is super impressive and definitely wasn't possible
00:00:16with the older versions of Opus 4. But people are calling this achievement clickbait and
00:00:22a half-truth because of the questionable techniques Anthropic used to get this result.
00:00:28So, did Anthropic cheat? Hit subscribe and let's find out.
00:00:31We'll split this video into three parts. First, we'll go through how the experiment was set up,
00:00:37then we'll go through the key findings, which I think every developer will learn a lot from,
00:00:42and finally, we'll go through if the results are valid, because I really have some opinions
00:00:47on how Anthropic was able to build this compiler. Okay, this experiment was carried out by Nicholas
00:00:52Carlini, who in my opinion is a very intelligent human being. I mean, let's take a look at how he
00:00:58set this up. So the actual project lived in a directory called Upstream, which was mounted
00:01:03to 16 different Docker containers. I know there are only four here, but let's imagine there are 16.
00:01:08And each one of these Docker containers contained a version of Claude code running Opus 4.6 and
00:01:15would clone the Upstream repo to Workspace and would make all the changes in Workspace and then
00:01:21push to Upstream. This was really clever because each agent could work in isolation without
00:01:27affecting the work of the other agents. Now, if there was ever a merge conflict, then Claude
00:01:32would be clever enough to resolve it and then push it back up to Upstream. Each agent will pick from
00:01:38some tasks. Now, I'm not sure if these tasks were generated by a human or generated by the agent
00:01:44based on running some tests, but there were some tasks that existed with names and each agent would
00:01:50take a new task and whenever it took a new task, it would create a new session. So in order to keep
00:01:56these agents running for a long time, a Ralph loop was used and so the agents would work on the task,
00:02:02finish the task, push to Upstream, then pick a new task with a fresh session and keep doing that over
00:02:08and over again. Now, if you've watched our video on Ralph, you'll know that the key to having long
00:02:13running agents is to have clearly defined tasks. But if you have 16 agents running at the same time,
00:02:19how do you prevent them from picking the exact same task? Task lock-in. The way this works is
00:02:24somewhere not mentioned by the author, a list of tasks exists and an agent will pick a task and give
00:02:30it a text file matching the name of that task, will create a commit to lock that task so only they can
00:02:36work on that task and push it to the Upstream directory. If another agent picks the same task
00:02:42and makes the same text file, if they try to push it to the Upstream repo, Git will reject it saying
00:02:48that that file already exists and then they'll have to work on a different task. And this is the basis
00:02:53of how Carlini stress tested the ability of long running agents powered by Opus 4.6 and the results
00:03:00are truly amazing. But from doing this experiment, he has found some interesting things that I think
00:03:07every single developer can learn from. The first thing is to build a test harness or a script that
00:03:12runs different types of tests because when Nick was running the experiment, yes, we're on first name
00:03:17terms now, he experienced Claude breaking existing features whenever a new feature was worked on.
00:03:23So he built a testing harness consisting of high quality tests from popular open source repos like
00:03:29SQLite, libjpg and Redis. And to prevent contacts pollution, he made sure the test harness only
00:03:35outputted logs that were useful to the agent. So basically error logs and created a file with all
00:03:41the other type of logs so that Claude could look into it whenever it needed. However, with thousands
00:03:47of tests, it would take agents hours to run the whole test suite when they could be using that
00:03:52time to do something else. So this is where Nick did something really clever. He added a fast flag
00:03:58to his testing harness, meaning an agent will only run either 1% or 10% of the total tests based on a
00:04:05figure that Nick wanted. And if each agent ran 10%, then that would be 160% of the tests, which is more
00:04:13than enough, but this isn't a bad thing. And the way it worked is the tests, so the specific tests
00:04:19that were run by each agent were randomised, but the seed number was the same, making it
00:04:25deterministic. So each agent will have the exact same random tests and eventually go through all
00:04:31the tests suite much faster than if they were running the whole thing by themselves. The next
00:04:36point is also clever, but a bit of a controversial one since it's to make use of existing technology.
00:04:41So far, each agent has been running unit tests from a bunch of existing open source projects,
00:04:46which was working well, splitting them up into 1% or 10% chunks. But when it came to compiling the
00:04:53Linux kernel, since these source files aren't individual unit tests, things became a bit
00:04:58difficult because each agent will try to compile the whole thing, come up with the same error,
00:05:04and so try to fix it and overwrite each other's fix. So the way Nick got around this was again,
00:05:09to have each agent run a percentage of the compilation and then have GCC, so the GNU
00:05:15compiler, run the rest of it. Nick called GCC the oracle since the Linux kernel should compile
00:05:22perfectly with it. So if an agent compiled a section of the Linux kernel with its own compiler,
00:05:27so a different section for each agent and the rest with GCC, if something broke, it was definitely
00:05:34the agent's compiler and not GCC, and therefore the agent would just fix that thing instead of fixing
00:05:40a bug from another agent. Now this is controversial because it's using an existing compiler to do
00:05:46something that Claude was asked to do from scratch. But we'll talk more about this towards the end of
00:05:51the video. Let's move on to the next point, which is to give your agent memory. Since new tasks are
00:05:57worked on by fresh Claude sessions that have pretty much no memory of what was done before it,
00:06:03Nick found it useful to update the readme file and to have different progress files with instructions
00:06:09of where things left off and the progress of the project so that new sessions would have a good
00:06:13base to start off from and not introduce bugs that have already been fixed before. And the final more
00:06:18obvious point is to give your agents different roles. The beauty of having multiple agents work
00:06:23on a code base in parallel is that multiple things can be done to the exact same piece of the code at
00:06:29the same time. So when new code wasn't being written, Nick gave unique roles to agents like
00:06:35one to check for duplicated code, another to find a way of making the code as performant as possible,
00:06:40and he even got one to critique the design from the perspective of a Rust developer who I hope didn't
00:06:45announce to the other agents that it was a Rust developer. But as successful as this project was,
00:06:51the real question is, did Anthropic cheat to get this result? Well, kind of. So the task was to
00:06:57build a C compiler from scratch and the agent didn't have access to the internet, so it came
00:07:03up with all the code itself. Or did it? Because it did have access to the test suites of open
00:07:10source projects and it had access to the compiled version of GCC. So technically it could have poked
00:07:16and prodded the GCC compiler, giving it inputs and inspecting the outputs and using that to direct the
00:07:24design of its own compiler written in Rust. But to be fair, if I was building a C compiler from
00:07:31scratch, I would do the same thing. I would look at existing compilers, see how they were implemented
00:07:36and use that to shape the direction of my own compiler. Now, if I was building a compiler for
00:07:41a brand new language, then of course things would be much difficult and maybe this would be a really
00:07:47good test to do for Claude to see if it's actually good at creating compilers from scratch. Maybe
00:07:53that's another idea for Nick to try out, but let's move on to talk about the autonomous nature of the
00:07:57experiment since that was also tested. And to be fair, yes, Claude did write all of the code, but
00:08:04it had some heavy steering from a human. A human decided on what test suite to run. A human started
00:08:11the loop and decided to use routes. A human was the one that built the test harness and gave agents
00:08:16specific roles. So while this is far from someone telling Claude to build a compiler and leaving it
00:08:22to run forever and ever, I wouldn't say the code was written by an agent that was a hundred percent
00:08:28autonomous because how good would the compiler have been if a human wasn't involved in the first place?
00:08:33And even with all the systems in place designed by a human, the Claude C compiler did have some
00:08:39key limitations. For example, it used the assembler and linker from GCC because the one it created was
00:08:46too buggy. It also needed GCC's 16-bit x86 compiler in order to boot up Linux. And to top that all off,
00:08:54the code wasn't very efficient. The most optimized version of the Claude's compiler was less
00:09:00performant than the least optimized version of the GCC compiler. So it looks like developers
00:09:05aren't going anywhere anytime soon, or at least for now.