wtf is Harness Engineer & why is it important

AAI Jason
Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00Thanks to HubSpot for sponsoring this video.
00:00:03So something really big actually happened in December 2025.
00:00:07And most of the people didn't even realize that.
00:00:09Andrew Cupsey tweeted about this last week.
00:00:10"It's very hard to communicate how much programming has changed due to AI in the last two months,
00:00:15specifically since last December."
00:00:17And Greg from OpenAI also talked about this.
00:00:20Since December, there's step function improvements in what the model and tools are capable of.
00:00:24And a few engineers had told him that their job has fundamentally changed since December
00:00:282025.
00:00:29So what actually happened in December 2025?
00:00:32In short words, the latest model introduced then is finally ready for fully autonomous
00:00:37long-running tasks.
00:00:38So with AI, the ultimate dream is always that while we are sleeping, AI can just work on
00:00:43tasks fully autonomous, 24/7.
00:00:46Even back in 2023, the most popular project, if you remember, is called AutoGPT.
00:00:50It is the first time those fully autonomous agent systems were introduced.
00:00:54And they have a fairly basic and simple architecture that using GPT-4 as a model to autonomously
00:00:59break down a list of tasks based on the user's goal and has simple memory storage to store
00:01:03the result.
00:01:04And people were doing some pretty crazy stuff like just give a goal, make a $100,000 and
00:01:08let it loop through tasks infinitely until completed.
00:01:11Back then, the system just break and fail miserably because the model is simply not ready.
00:01:15But since December last year, this really changed.
00:01:18The models have significantly higher quality, long-term coherence, and they can power through
00:01:22much larger and longer tasks.
00:01:24And we saw all sorts of different experimentation came out from the industry.
00:01:28Firstly, from January, we got this super hot concept called rough loop, and most of basic
00:01:33and simple agent iteration loop to force model work longer so that it can take more complex
00:01:37tasks.
00:01:38We just full looped the model with some simple condition checks, but already we start seeing
00:01:42the difference.
00:01:43And one week later, Cursor also released their experimentation where they use GPT-5.2 to autonomously
00:01:49build a browser from scratch with 3 million lines of code.
00:01:52And Entropic also released this experimentation they had where they get a team of cloud codes
00:01:57to autonomously working on a C compiler from scratch for two weeks.
00:02:01And in the end, it delivered a functional version with zero manual coding.
00:02:05It can even run Doom inside this compiler as well.
00:02:08And same time, OpenClaw started gaining attention and had this explosive growth that we never
00:02:13seen before.
00:02:14And it was very difficult to understand what was going on with OpenClaw because from outside,
00:02:18it's very easy to categorize OpenClaw just be another menace, but live inside your own
00:02:23computer and can also access from Telegram.
00:02:27Like why this is so popular.
00:02:29And only later after I used Dplay, I realized that the real difference is that OpenClaw represents
00:02:35this type of always-on, long-running, fully autonomous agents that is very different from
00:02:40all the other agent systems we used before where human is the main driver to prompt for
00:02:45the next action.
00:02:46OpenClaw is always-on and it is proactive.
00:02:49And this autonomous feeding is created by a fairly simple architecture where it has memory
00:02:53context layer with trigger and cron job to automatically take actions and have the full
00:02:58computer access, which is powerful environment it can operate in.
00:03:02And I believe OpenClaw is the first project that really opened up the biggest paradigm
00:03:06shift in 2026 that we are moving from a co-pilot simple task-based agent system to those long-running
00:03:13fully autonomous agent.
00:03:15Something that's always-on, always ready, atomicity delivering super complex coordinated work.
00:03:20This is a critical shift you have to understand.
00:03:22The model today is actually much more powerful than you think as long as you design the right
00:03:27system to unlock it.
00:03:28And this is a crux of what I want to talk about today.
00:03:30The Hardness Engineer to enable long-running autonomous systems.
00:03:34If it's the first time you hear about Hardness Engineer, this is like evolution from what
00:03:38we've been previously talked about which is Context Engineer or Prompt Engineer.
00:03:41So previously we really focused on how to optimize the prompts within the effective context window
00:03:46to get a model to have the best performance for a single agent loop session.
00:03:49But Hardness Engineer is really focused on those long-running tasks which means how do
00:03:53you design a system that can work across different sessions and multiple different agents.
00:03:57And how do you design the right workflow to making sure the relevant context will be retrieved
00:04:01for each session and right set of toolings to extract the most out of models.
00:04:05This is a fairly new concept, but the good thing is that industry already converged on
00:04:09some best practices that you can use from entropic, vercel, launching and many others.
00:04:14More goes through each one of them one by one so you can see the patterns.
00:04:16But before we dive into this, with this paradigm shift fully atop agents, one of the biggest
00:04:21opportunities for the next 6-12 months is build an open cloud for a certain vertical.
00:04:25Which means you deeply investigate and understand the end-to-end workflow of a certain vertical.
00:04:29And build an atop agent with the correct environment and tooling to enable the end-to-end process.
00:04:34That's why I want to introduce you to this awesome research HubSpot did on the AI adoption
00:04:39in email marketing report.
00:04:40It is a fascinating report for you to understand for a vertical like email marketing where people
00:04:44actually use AI today and what are the gaps.
00:04:47Because this report showcase clear workflow and opportunity in email marketing that you
00:04:51can potentially automate.
00:04:52They survey hundreds of email marketers from top companies to understand exactly how AI
00:04:57is reshaping their workflows.
00:04:58They talk about why marketers are still doing a lot of heavy editing, what were the cause
00:05:03to it, as well as the biggest challenges they are facing today when implementing AI in the
00:05:06email marketing.
00:05:07And each of them is a big opportunity for you to build a fully atop agent.
00:05:11They even dive into the specific KPI that they care more about and AI has shown proven
00:05:15results.
00:05:16As well as what exactly things email marketers are really want from AI.
00:05:20So if you are a builder who are thinking about the next big agent product to build, I highly
00:05:24recommend you go check out this awesome resource.
00:05:27I have put the link in the description below for you to download for free.
00:05:30And thanks HubSpot for sponsoring this video.
00:05:32Now let's get back to harness engineer for long running agent systems.
00:05:36And at high level, there are three learnings I took away from those.
00:05:39One is that for long running task agents, the critical part of system design is creating
00:05:44this legible environment where each sub agent or sessions can actually understand where things
00:05:49are at.
00:05:50Most likely there are some workflows that can be done to enforce legibility of the environment.
00:05:54And I'll explain a bit more on that.
00:05:56The second is verification is critical.
00:05:58You can improve system output significantly by allowing it to verify its work effectively
00:06:03with faster feedback loop.
00:06:04And third is that we need to trust the model more instead of building specialized tooling
00:06:08they wrap a lot of reasoning and logic prematurely.
00:06:11We should give model max contacts with generic tooling that they natively understand and let
00:06:16it just explore like humans.
00:06:17And I'll unpack those three things one by one as we go through each block here.
00:06:20First is Entropic's effective harness for long running agents blocks.
00:06:24So they've experimented using cloud code SDK to build a specialized agent for super long
00:06:29running tasks like build a clone of cloud.ai website.
00:06:32The very first failures they observe is that firstly, agents tend to do too much at once.
00:06:37Essentially, it will always try to one shot the whole app.
00:06:40And this led to the model running out of context in the middle of its implementation and leaving
00:06:45the next session to start with the feature half implemented or documented.
00:06:49Then the agent would have to guess what actually happened and spend substantial time trying
00:06:52to get the basic app working again.
00:06:55And second failure they observe is that agents tend to declare job complete prematurely.
00:07:00You probably experienced this a few times yourself as well.
00:07:02The cloud code or cursor will just claim the purger or feature is completed.
00:07:05But once you test it, it actually didn't work.
00:07:07So their approach to solve those default model failure behavior is that firstly set up initial
00:07:12environment that lays the foundation for all the features that given prompt required, which
00:07:16set up the agent to work step by step and feature by feature.
00:07:20So this is kind of similar to the plan or PRD approach that we normally took.
00:07:23The second is that it start prompt each agent to make incremental progress towards its goal
00:07:27while also leaving the environment in a clean state at the end of each session.
00:07:32What they did is starting design this two part solution.
00:07:35They will have this initializer agent that uses a specialized prompt to ask model to set
00:07:40up initial environment with the init.sh script, which will set up dev server, for example,
00:07:45so that next model don't need to worry about those things.
00:07:48And also a cloud progress.txt file that keeps logs on what agent have done as well as initial
00:07:53git commit that shows what file has been added.
00:07:55Then a coding agent for each subsequent session to ask model to make incremental progress,
00:08:01then leave structured updates.
00:08:02And all those efforts are really trying to serve one purpose is how can they define an
00:08:07environment where agents can quickly understand the state of work when starting with a fresh
00:08:11context window.
00:08:13So the workflow is that the initializer agent would firstly try to set up a environment or
00:08:17you can call it documentation system to track and maintain overall plan.
00:08:21And the environment they design here is firstly they will have a feature list documents to
00:08:25prevent agent one shotting the whole app or prematurely considering the project complete.
00:08:30And they will get the initializer agent to break down the project into over 200 features
00:08:34and log them in a local JSON file looks something like this, where each task has detailed spec
00:08:39as well as pass or fail state.
00:08:41At default, all tasks will be marked as fail.
00:08:43So force model to always look at overall project goal and the progress pick up highest priority
00:08:49task and do the next thing.
00:08:50But to make this workflow works, they also need a way to force a model, leave the environment
00:08:55in a clean state after making the code change in their experiments, they found the best way
00:08:59is to ask model to commit progress to get with descriptive comment message and write
00:09:05a summary of his progress in progress file, but was it just documentation and contact
00:09:08environment itself is not enough because model at default has this tendency to mark something
00:09:13as completed without proper testing and at the beginning, they were just prompting cloud
00:09:17code to always do the tests after the code change by doing unit test or API tests for
00:09:22the dev server.
00:09:23But all those things will often fail to recognize that a feature is not working end to end.
00:09:27Because things really start changing when they give model proper tooling to do the end
00:09:30to end test by itself, like Puppeteer MCP or Chrome dev tool, where agent was able to
00:09:35identify and fix bugs that were not directly obvious from the code itself.
00:09:39So basically, they are setting up the structure where they have the initialized agent to break
00:09:43down the user's goal into a list of features alongside init.sh to be able to run the dev
00:09:47server and progress files.
00:09:49So the next coding agent can just read the feature list to get an understanding about
00:09:53the overall project plan and pick up high priority tasks and progress file and get a log to understand
00:09:57where things are at.
00:09:59Then run init.sh to start dev server immediately and do end to end test to verify the environment
00:10:04is clean so that it can get a full picture, faster feedback loop while each new session
00:10:09and context window happen.
00:10:10In OpenAI's blog, they talk about very similar things.
00:10:13You have to make sure your application environment is legible.
00:10:16They make the whole repository knowledge the system or record.
00:10:19Initially, they put a gigantic agents.md file and it failed in predictable ways because it's
00:10:23just too much context for any agent to manage and maintain.
00:10:27So what they did is design a proper doc environment structure and treated agents.md file as a table
00:10:32of contents.
00:10:33So they set up this documentation system from architectures, the design docs, the execution
00:10:37plan, DB schema, product specs, and design front-end plan, security, and many more, and
00:10:42put this table of content into agents.md file so the agent can actually retrieve back random
00:10:47information when needed.
00:10:49And this enables progressive disclosure and OpenAI actually do it even further.
00:10:53They will try to push not only the code knowledge, but also Google Docs, Slack message, all those
00:10:58other fragmented information, feed the data into the repository as a repository local version
00:11:03artifacts.
00:11:04So the agent can also retrieve because from agent point of view, if anything can be accessed
00:11:09in the environment, then effectively it didn't exist.
00:11:11But again, documentation itself didn't really keep a fully agent-generated codebase coherent.
00:11:16They also introduced certain programmatic workflow to enforce invariants.
00:11:20For example, they layer domain architecture with explicit cross-cutting boundaries, which
00:11:25allow them to enforce those rules with custom checks, linters, and structural tests, which
00:11:29can be automatically triggered and injected by every git pre-commit.
00:11:33And those type of architecture usually you will postpone until you have hundreds of engineers
00:11:37in traditional software company, but with coding agent is an early prerequisite.
00:11:41Within those boundaries, you allow teams and agents to significant freedom in how solutions
00:11:46are expressed without micromanaging and worry the architecture going to drift.
00:11:49Meanwhile, they also improve the codebase a lot.
00:11:52For example, they made the app bootable per git work trees, so codecs can just launch and
00:11:55drive many different instances.
00:11:57And they also wired a Chrome dev to protocol into the agent runtime so that the agent can
00:12:01reproduce bugs, valid fix, buy DOM snapshots, screenshots, and navigation.
00:12:05And with also environment and workflow set up, the repository finally crossed a minimum
00:12:09threshold where codecs can end-to-end drive a new feature.
00:12:13So every time when codecs receive a single prompt, the agent will start validating the
00:12:17current state of the codebase, reproduce a reported bug, record a video to demonstrate
00:12:21the failure, implement the fix, validate the fix by driving the application, record a second
00:12:25video demonstrating the resolution, and eventually merge the change.
00:12:29So those two blocks showcase very good learnings and necessary harness systems you need to put
00:12:32in place for a fully autonomous system.
00:12:34Meanwhile, there are also certain learnings.
00:12:36Quite often when we build agents, especially vertical-specific agents, our tendency is to
00:12:40build specialized tooling to do domain-specific tasks.
00:12:43The learning goal is that large learning models almost always work better with generic tools
00:12:47that they natively understand.
00:12:49Vercel released this awesome article about how they redesigned their tasks to SQL agents.
00:12:53So they spent months building sophisticated internal attacks to SQL agent D0 with specialized
00:12:58tool heavy prompt engineering and careful context management.
00:13:02But as many of us experienced before, those type of systems kind of work but is very fragile,
00:13:06slow, and require constant maintenance.
00:13:09Because every new edge case has happened, you will need to inject a new prompt to the agent.
00:13:12But later they tried one thing that totally changed the trajectory.
00:13:15They deleted most of the specialized tool from the agent down to a single batch command tool.
00:13:20And with this much simpler architecture, the agent actually performed 3.5 times faster with
00:13:2537% fewer tokens and success rate increased from 80% to 100%.
00:13:30Similar learning has been shared from Entropic team as well where they talk about instead
00:13:34of having specialized search-linked executed tools, they just have one batch tool where
00:13:38they can run grip, tail, npm, npm run linked.
00:13:41And fundamentally, I think it's because also the large learning model is much more familiar
00:13:45with those code native tools that has billions of training tokens versus bespoke tool calling
00:13:49JSON that it needs to generate.
00:13:51And I've talked about this in programmatic tool calling video that I released last week.
00:13:55And I believe it is similar fundamental principles here, but the foundation of those simple architecture
00:13:59is again the good context and documentation environment where model can use generic tools
00:14:05to retrieve context progressively.
00:14:06And it is same case with OpenCloth.
00:14:09One reason OpenCloth is so interesting is that they have a surprisingly simple but effective
00:14:13context environment.
00:14:15They have list of documentation to store core information with this foundation.
00:14:18They only have the most basic tooling like read, write, edit files, run batch commands
00:14:23and send messages.
00:14:24All the risk is coming from giving agent environment to retrieve relevant context plus a big skill
00:14:29libraries to expand capabilities.
00:14:31So those are three practical learnings about how to do harness engineer for long-running
00:14:35complex agents.
00:14:36By setting up a legible context environment to enable each session to grab context effectively
00:14:41and write workflow and tooling so that model can verify its work effectively, drive faster
00:14:46feedback loop and trust agent with generic tools that it natively understands.
00:14:50If you're interested, I'm going to share more in depth about how do I take these learnings
00:14:54and transform into a development lifecycle process.
00:14:58In AI Builder Club, we have courses and work through about vibe coding and building production
00:15:02agents.
00:15:03And every week, myself and industry experts share the latest practical learnings.
00:15:08So if you're interested in learning what I'm learning every day, you can click on the link
00:15:12below to join the community.
00:15:13I hope you enjoyed this video.
00:15:14Thank you and I'll see you next time.

Description

Get free AI Adoption in Email Marketing Report: https://clickhubspot.com/cb84e9 🔗 Links - Join AI Builder Club: https://www.aibuilderclub.com/ - Try Superdesign: http://superdesign.dev/ - Follow me on twitter: https://twitter.com/jasonzhou1993

Community Posts

View all posts