Anthropic Just Revealed The Truth About Agent Harnesses

AAI LABS
Computing/SoftwareManagementInternet Technology

Transcript

00:00:00Over the past few months we have covered many AI coding frameworks including BMAD, GSD, Speckit, and Superpowers, and a lot of you actually started using them.
00:00:08But Anthropic just ran experiments on their own harness, removing components one by one, and measuring what actually mattered.
00:00:14Their conclusion was that most of it is now a dead weight.
00:00:17Every component in a framework encodes an assumption about what the model cannot do on its own, and with Opus 4.6, those assumptions have gone stale.
00:00:25We went through the whole thing and mapped out what still matters, what you can strip out, and what your setup should actually look like now.
00:00:32Agent harnesses play an important role in making agents work substantially better over long horizons.
00:00:37Anthropic has already released an agent harness, which we covered in detail in a previous video, explaining how to set it up and use it.
00:00:43We have also covered other frameworks in that same context, and while their implementations differ, they are all trying to do the same thing.
00:00:50But when these frameworks were released, the models were not as capable as Opus 4.6 is now.
00:00:55For example, frameworks like GSD are focused on context isolation, but that is not a problem with Opus 4.6.
00:01:01Not only because of the million token context window, but another reason that we will talk about in a bit.
00:01:06Therefore, a lot of previously implemented frameworks are now an overhead for new model capabilities.
00:01:11Anthropic actually ran experiments testing out different aspects of the harness, removing each one and measuring its impact.
00:01:17From their findings, they concluded that all an agent harness actually needs is agents for planning, generation, and evaluation.
00:01:24The rest are just ways of doing things that become dead weight given how capable the models are now.
00:01:29The core theory is that every component in an agent harness, no matter which one you are using, relies on the same principle.
00:01:35Each component encodes an assumption about what the model can do on its own.
00:01:38These assumptions should be stress tested because they may be incorrect, and they will go stale as the model improves, and that's what they did throughout the article.
00:01:46Therefore, with the evolution of the models, your harness should also evolve, and if you are working with the same principles laid out a few months back, you are not keeping up.
00:01:54Planning is the first step that remains unchanged across every framework, but the way you plan has to change for more capable models.
00:02:01Anthropic's previous long-running harnesses required the user to provide a detailed spec up front.
00:02:06Frameworks like BeMad and SpecKit literally shard the task into smaller fragments and microtasks that help the AI agent implement it with ease.
00:02:14And these weren't just small tasks, they were literally detailed steps that agents just had to follow without thinking.
00:02:20This is because at that time, the models were not capable enough and needed to be microguided so that they could perform the way you wanted.
00:02:27But with Opus 4.5 and 4.6, this has changed.
00:02:30When Anthropic tested this, they found that if the planner tried to specify micro-technical details up front, a single error would cascade through every level of implementation, making it hard for the agent to deviate and fix issues on its own.
00:02:43It all relied on how well-written the plan was.
00:02:45Therefore, planning has now become high-level rather than a detailed technical implementation.
00:02:50Agents are much smarter on their own now and you just have to tell them what deliverables are needed.
00:02:55They can figure out the path toward that on their own.
00:02:57With this shift, planning approaches like those in BeMad and SpecKit are no longer as relevant.
00:03:02You can limit BeMad to the planning phase up to PRD generation with no need to go into the technical sharding process.
00:03:08As we have mentioned before, PRD generation with BeMad is effective because it has specialized agents for understanding product requirements better than Claude would have done on its own.
00:03:18This is because those agents have the external context for specific tasks added in by the author.
00:03:23Alternatively, you can use the questioning session from Superpowers since it was actually intended to identify edge cases, which can be more effective than multi-level task documentation.
00:03:32But the core problem with overly detailed planning is that it locks the agent down and does not leave room for the AI to make discoveries and figure things out on its own.
00:03:40Anthropic has also given an example plan that was generated by the planner agent, which you can use to set up your own planner agent.
00:03:46It clearly outlines that the plan should go big on scope and push the boundaries of whatever app idea you provide.
00:03:52The core idea is to keep the project at the product level, not the implementation level.
00:03:56This matters because if it tries to plan out the implementation within the project plan, it becomes too focused on technical details and may fail to deliver what is actually needed for a complete product.
00:04:06Now you might think that Claude's own plan mode already does similar planning by asking questions and providing a detailed plan.
00:04:12But here is the difference. Even though Claude has a planning agent, it still focuses heavily on implementation details and does not truly operate at the product level, which goes against Anthropic's findings.
00:04:22Therefore, once you have this in place, you can simply ask Claude to use the agent you created to plan your app, and it will generate a complete plan and document it in your folder as it progresses.
00:04:31This plan includes a full feature breakdown at the product level, and with each phase, it includes user stories that show what the user's perspective looks like.
00:04:40This helps Claude implement the correct workflows that users actually expect.
00:04:44But before we move forwards, let's have a word by our sponsor, Minimax.
00:04:47Setting up AI agents is a nightmare. API keys, server configs, Docker setups, and after all that, your assistant forgets everything the moment you close the tab.
00:04:56The solution is MaxClaw, a cloud-powered AI at your fingertips.
00:04:59No setup, no headaches, you can deploy your own OpenClaw.
00:05:02Just click deploy, and you're live in under 10 seconds. It builds websites, writes code, runs research, and automates your busy work all from simple text prompts.
00:05:12MaxClaw connects directly to Telegram, Slack, Discord, and more, letting you automate workflows, browse the web, and even generate images or videos all from a simple chat.
00:05:21It is part of Minimax Agent, an AI-native workspace where everyone becomes an agent designer.
00:05:27It works on Mac, Windows, powered by M 2.7, which matches Claude Opus 4.6 on Sweetbench.
00:05:33Stop wrestling with complex setups, let MaxClaw handle it, and click the link in the pinned comment to get started.
00:05:39The agent that writes the code should not be the one evaluating it.
00:05:42This is the second most common problem, and it is not usually discussed much.
00:05:46Self-evaluation is problematic because if you use the same agent that wrote the code to evaluate it, it tends to respond very confidently and praise its own work, even when the quality is clearly subpar.
00:05:56This might be easier to manage for tasks that have quantitative metrics, like whether the APIs that were implemented are actually working.
00:06:03But this problem becomes much more pronounced for tasks that do not have clearly verifiable outcomes.
00:06:08The biggest example of this is the UI.
00:06:10What constitutes a good UI is subjective, and AI might not fully grasp your intentions.
00:06:15It may consider its own implementation as well done, even if it does not meet your standards.
00:06:19This issue was already recognized by the creators of multiple frameworks, and they implemented their own evaluation mechanisms to address it.
00:06:26All of the frameworks we have covered, like GSD, BMAD, and Superpowers, ensure that the same agent that wrote the code does not get to evaluate its quality.
00:06:34This approach significantly improves the accuracy and reliability of the agent's evaluations.
00:06:39Therefore, whether you are using an existing framework or building on your own, you need to ensure that the evaluator is completely separate from the implementer.
00:06:47Before implementation begins, both the generator and evaluator agents negotiate a contract, agreeing on what "done" looks like for the work.
00:06:54This helps because both agents clearly know what to achieve and what to verify.
00:06:58With high-level planning, there still needs to be actionable, implementable steps.
00:07:02But during testing with the harness, they tried removing the sprint contract.
00:07:06They found that Opus 4.5 was less efficient in this scenario because the evaluator still had to step in to catch issues.
00:07:12But with Opus 4.6, the model's capabilities had improved so much that the contract was not necessary.
00:07:18The generative agent was capable enough to handle most of the work on its own.
00:07:22Therefore, for smaller models like Sonnet or Haiku, you still need to document tasks.
00:07:27Break them down properly into sprint structures and have each agent agree on what "complete" looks like.
00:07:32But with more capable models, you can rely on Opus to execute the high-level plan without these additional steps.
00:07:38Now we said that there is a reason why context isolation matters.
00:07:42This is because smaller models experience context anxiety, a phenomenon where models start losing coherence on lengthy tasks as their context window fills up.
00:07:51When this happens, they wrap up work prematurely and claim they have implemented tasks correctly, even when they have not.
00:07:57The solution that helped was a context reset, clearing their context windows before starting implementation.
00:08:02Since the context was cleared, they could rely on a task breakdown documented externally, which persisted across context resets.
00:08:08But the models exhibited so much context anxiety that compaction alone was not enough.
00:08:13They needed additional measures to prevent problems on longer tasks.
00:08:17Starting with Opus 4.5, however, models no longer exhibit this behavior.
00:08:21These agents can run continuously across an entire session, and the way Claude handles compaction is sufficient for their functioning.
00:08:28Therefore, context resets are no longer necessary, and detailed task breakdowns like those in BMAD and SpecKit are not needed either, with just high-level guidance alone being enough.
00:08:37The generator agent is the main implementer that actually builds the app feature by feature.
00:08:42It takes the specs from the plan and continuously implements them, while integrating with Git for version control.
00:08:47The generator works in coordination with the evaluator agent.
00:08:50After building a feature, it hands it over for testing and receives feedback to improve its implementation.
00:08:56Its workflow is organized into several steps, understanding the task, implementing it, and refining the implementation.
00:09:02Even within the implementation phase, work is divided into four sub-phases covering different aspects.
00:09:07It follows the design direction, verifies its work, and then hands it to the evaluator.
00:09:11This creates a structured, step-by-step pattern, enabling the agent to implement an entire app independently and systematically.
00:09:18The evaluator agent acts as the adversary to the generator.
00:09:21Its job is to ensure the app is implemented correctly, not by doing a generic "find bugs" pass, but by approaching it critically from the perspective that bugs exist.
00:09:30It can use tools like PlayWrite to test the app by simulating user interactions, identify bugs based on predefined criteria, and send feedback back to the generator.
00:09:39By reading the plan, the evaluator gains a clear understanding of what "done" should look like and checks everything thoroughly before approving it.
00:09:46Each framework has its own validator, but the approaches differ significantly.
00:09:50BMAD uses specialized code review and QA agents that generate and run tests, evaluating the code from multiple angles.
00:09:57GSD uses a verifier sub-agent that checks the implementation against the existing plan and produces a documentation report.
00:10:04Superpowers relies on fresh sub-agents and enforces strict TDD, where no code can be written before the test cases.
00:10:10If the agent tries to bypass this, it is blocked.
00:10:13SpecKit treats specs as the source of truth and allows the agent to verify code against the documentation.
00:10:18But none of these frameworks provide a scoring mechanism with the level of rigor Anthropic was aiming for.
00:10:24Therefore, the evaluator in Anthropic's harness is the closest to Ralph Loop's strict implementation enforcement for Claude, ensuring the agent actually delivers what is needed with a proper graded evaluation mechanism.
00:10:35Also, if you are enjoying our content, consider pressing the hype button, because it helps us create more content like this and reach out to more people.
00:10:43The agent has no means to know what the right output looks like for you, especially in cases where the implementation is not quantifiable.
00:10:49Therefore, you use graded evaluation mechanisms so that they know what the right output looks like to you.
00:10:54When Anthropic gave an example for the evaluation metrics for the front-end, they mentioned that the AI tends to converge on similar outputs most of the time.
00:11:02They set four grading criteria for both the generator and evaluator agents.
00:11:06The first is quality of the design, instructing it to check if the field is coherent or just separate components strung together.
00:11:12Then originality, which is one of the main ones because AI tends to default to the same purple and white gradient pattern for most UIs.
00:11:19This goes against how humans design, because for a human, each design choice is deliberate and this makes it easily identifiable when the website does not look good.
00:11:27The third is craft, the minor details like typography, spacing consistency, and color harmony, where the contrast ratio is technically balanced rather than giving it a more creative look.
00:11:37And the last is functionality, because in terms of UI, each component plays a visual role in enhancing the user experience.
00:11:44Claude already scores well on craft and functionality, but the rest are the most common struggles, and the prompts need to push it to its best capability by emphasizing that the best design comes from quality.
00:11:54Therefore, when you are building your app, you can set up similar criteria for as many features as you want, like code architecture, the front-end, UX user flows, and more.
00:12:02Have each part mentioned in the criteria, have a dedicated score so that the model can identify its importance based on how well it performs.
00:12:10These files are referenced in the evaluator agent because the evaluator's job is to score, so it knows what rubric it should be following.
00:12:17Given everything we have covered, you might wonder what you should actually do now.
00:12:21If you want a framework so that your setup is easier, go for GSD, because GSD inherently uses the planner, generator, evaluator loop by default, but its evaluator just matches the code against the existing plans and relies on user acceptance testing.
00:12:35It uses a pass and fail mechanism, not a scored implementation. Therefore, you can take the best parts of the anthropic framework and combine them with GSD, for example changing the evaluator agent and combining it with the criteria so that the agent knows what the right implementation is.
00:12:49But if you want to use anthropic's framework and set it up on your own, you can implement it by creating agents based on their respective roles and have them work together using agent teams.
00:12:58You can use one agent team member as a generator and another as an evaluator.
00:13:03The reason for using agent teams is because they can communicate with each other, while sub-agents cannot and would have to write to a document, creating overhead.
00:13:10Therefore, Claude creates the tasks from the high-level plan and creates both agents at the same time, where one is implementing while the other is running tests using the Playwright MCP with the browser, waiting for updates from the generator so that it can start the testing process.
00:13:24The evaluator keeps verifying the work and communicates the issues with the generator and they work in coordination to implement the whole app that matches your standards.
00:13:33Now all the agents used here along with all the resources are available in AI Labs Pro for this video and for all our previous videos from where you can download and use it for your own projects.
00:13:43If you've found value in what we do and want to support the channel, this is the best way to do it. The link's in the description.
00:13:48That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below.
00:13:57As always, thank you for watching and I'll see you in the next one.

Key Takeaway

Modern agent harnesses for Opus 4.6 should strip away micro-task sharding and context resets in favor of a lean, three-agent loop focused on high-level product goals and rigorous, graded evaluation.

Highlights

Opus 4.6 eliminates the need for context isolation and context resets because it no longer exhibits context anxiety or coherence loss on lengthy tasks.

Anthropic's testing reveals that detailed technical sharding and micro-task documentation are now dead weight that causes errors to cascade through implementation levels.

The updated agent harness architecture requires only three core components: high-level planning, generation, and a separate evaluator agent.

Effective UI evaluation relies on four specific grading criteria: design quality, originality, craft (typography and spacing), and functionality.

Generator and evaluator agents must be separate entities because self-evaluation leads to false confidence and failure to identify subjective UI flaws.

Agent teams are superior to sub-agents for harness construction because they allow direct communication without the overhead of writing to external documents.

Timeline

The Obsolescence of Legacy Agent Frameworks

  • Every component in an AI framework encodes an assumption about model limitations that may no longer exist.
  • Opus 4.6 features a million-token context window that renders previous context isolation strategies unnecessary.
  • Frameworks like BMAD and GSD now carry overhead that conflicts with the increased capabilities of the latest models.

Anthropic conducted experiments by removing harness components one by one to measure their actual impact on performance. They concluded that many features designed for older models, such as complex sharding, have become dead weight. Stress testing these assumptions is necessary because as models improve, the principles used to guide them a few months ago quickly go stale.

Shift from Technical Sharding to High-Level Planning

  • Detailed technical specs provided up front cause errors to cascade through every level of implementation.
  • Planning must focus on product-level deliverables rather than micro-technical implementation steps.
  • Effective plans include full feature breakdowns and user stories to ensure correct workflow implementation.

Previous harnesses required micro-guiding because models could not think for themselves, but Opus 4.5 and 4.6 perform better when given room to make discoveries. Overly detailed planning locks the agent down and prevents it from fixing issues on its own. The new approach uses a planner agent to define the scope and push boundaries at the product level, allowing the implementer to figure out the technical path independently.

The Necessity of Independent Evaluation

  • A single agent cannot accurately evaluate its own work because it responds with excessive confidence even when quality is subpar.
  • The generator and evaluator agents should negotiate a contract to agree on what constitutes a completed task before work begins.
  • Opus 4.6 is capable enough to execute high-level plans without the strict sprint contracts required by smaller models like Sonnet or Haiku.

Self-evaluation is particularly problematic for subjective tasks like UI design where there are no quantitative metrics. While earlier models required documented tasks and external task breakdowns to function, the latest models can run continuously across an entire session. This eliminates the need for context resets, which were previously used to solve 'context anxiety' where models would wrap up work prematurely as windows filled up.

Adversarial Evaluation and Graded Metrics

  • The evaluator agent acts as an adversary that approaches the code with the assumption that bugs already exist.
  • Standard pass/fail mechanisms are insufficient for high-quality UI and architecture implementation.
  • Graded evaluation mechanisms use specific rubrics for design quality, originality, craft, and functionality to push model capabilities.

The evaluator uses tools like Playwright to simulate user interactions and verify the app against the high-level plan. Anthropic's harness specifically addresses the tendency of AI to default to generic purple and white gradient patterns by grading originality and craft, including typography and spacing consistency. By assigning dedicated scores to different features, the model can identify which aspects of the project require more focus based on the rubric.

Implementation Strategies for Agent Teams

  • GSD can be updated by replacing its pass/fail evaluator with the Anthropic graded evaluation mechanism.
  • Agent teams are the preferred implementation method because they support real-time communication between members.
  • The evaluator runs tests using the Playwright MCP while the generator implements features to create a coordinated workflow.

Users can build their own harness by creating specialized roles within an agent team rather than using sub-agents, which create overhead by writing to documents. In this setup, the evaluator waits for updates from the generator and immediately begins testing in a browser environment. This coordinated loop ensures the final application matches high standards for both code architecture and user experience.

Community Posts

View all posts