wtf is Harness Engineer & why is it important

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00Thanks to HubSpot for sponsoring this video.

00:00:03So something really big actually happened in December 2025.

00:00:07And most of the people didn't even realize that.

00:00:09Andrew Cupsey tweeted about this last week.

00:00:10"It's very hard to communicate how much programming has changed due to AI in the last two months,

00:00:15specifically since last December."

00:00:17And Greg from OpenAI also talked about this.

00:00:20Since December, there's step function improvements in what the model and tools are capable of.

00:00:24And a few engineers had told him that their job has fundamentally changed since December

00:00:282025.

00:00:29So what actually happened in December 2025?

00:00:32In short words, the latest model introduced then is finally ready for fully autonomous

00:00:37long-running tasks.

00:00:38So with AI, the ultimate dream is always that while we are sleeping, AI can just work on

00:00:43tasks fully autonomous, 24/7.

00:00:46Even back in 2023, the most popular project, if you remember, is called AutoGPT.

00:00:50It is the first time those fully autonomous agent systems were introduced.

00:00:54And they have a fairly basic and simple architecture that using GPT-4 as a model to autonomously

00:00:59break down a list of tasks based on the user's goal and has simple memory storage to store

00:01:03the result.

00:01:04And people were doing some pretty crazy stuff like just give a goal, make a $100,000 and

00:01:08let it loop through tasks infinitely until completed.

00:01:11Back then, the system just break and fail miserably because the model is simply not ready.

00:01:15But since December last year, this really changed.

00:01:18The models have significantly higher quality, long-term coherence, and they can power through

00:01:22much larger and longer tasks.

00:01:24And we saw all sorts of different experimentation came out from the industry.

00:01:28Firstly, from January, we got this super hot concept called rough loop, and most of basic

00:01:33and simple agent iteration loop to force model work longer so that it can take more complex

00:01:37tasks.

00:01:38We just full looped the model with some simple condition checks, but already we start seeing

00:01:42the difference.

00:01:43And one week later, Cursor also released their experimentation where they use GPT-5.2 to autonomously

00:01:49build a browser from scratch with 3 million lines of code.

00:01:52And Entropic also released this experimentation they had where they get a team of cloud codes

00:01:57to autonomously working on a C compiler from scratch for two weeks.

00:02:01And in the end, it delivered a functional version with zero manual coding.

00:02:05It can even run Doom inside this compiler as well.

00:02:08And same time, OpenClaw started gaining attention and had this explosive growth that we never

00:02:13seen before.

00:02:14And it was very difficult to understand what was going on with OpenClaw because from outside,

00:02:18it's very easy to categorize OpenClaw just be another menace, but live inside your own

00:02:23computer and can also access from Telegram.

00:02:27Like why this is so popular.

00:02:29And only later after I used Dplay, I realized that the real difference is that OpenClaw represents

00:02:35this type of always-on, long-running, fully autonomous agents that is very different from

00:02:40all the other agent systems we used before where human is the main driver to prompt for

00:02:45the next action.

00:02:46OpenClaw is always-on and it is proactive.

00:02:49And this autonomous feeding is created by a fairly simple architecture where it has memory

00:02:53context layer with trigger and cron job to automatically take actions and have the full

00:02:58computer access, which is powerful environment it can operate in.

00:03:02And I believe OpenClaw is the first project that really opened up the biggest paradigm

00:03:06shift in 2026 that we are moving from a co-pilot simple task-based agent system to those long-running

00:03:13fully autonomous agent.

00:03:15Something that's always-on, always ready, atomicity delivering super complex coordinated work.

00:03:20This is a critical shift you have to understand.

00:03:22The model today is actually much more powerful than you think as long as you design the right

00:03:27system to unlock it.

00:03:28And this is a crux of what I want to talk about today.

00:03:30The Hardness Engineer to enable long-running autonomous systems.

00:03:34If it's the first time you hear about Hardness Engineer, this is like evolution from what

00:03:38we've been previously talked about which is Context Engineer or Prompt Engineer.

00:03:41So previously we really focused on how to optimize the prompts within the effective context window

00:03:46to get a model to have the best performance for a single agent loop session.

00:03:49But Hardness Engineer is really focused on those long-running tasks which means how do

00:03:53you design a system that can work across different sessions and multiple different agents.

00:03:57And how do you design the right workflow to making sure the relevant context will be retrieved

00:04:01for each session and right set of toolings to extract the most out of models.

00:04:05This is a fairly new concept, but the good thing is that industry already converged on

00:04:09some best practices that you can use from entropic, vercel, launching and many others.

00:04:14More goes through each one of them one by one so you can see the patterns.

00:04:16But before we dive into this, with this paradigm shift fully atop agents, one of the biggest

00:04:21opportunities for the next 6-12 months is build an open cloud for a certain vertical.

00:04:25Which means you deeply investigate and understand the end-to-end workflow of a certain vertical.

00:04:29And build an atop agent with the correct environment and tooling to enable the end-to-end process.

00:04:34That's why I want to introduce you to this awesome research HubSpot did on the AI adoption

00:04:39in email marketing report.

00:04:40It is a fascinating report for you to understand for a vertical like email marketing where people

00:04:44actually use AI today and what are the gaps.

00:04:47Because this report showcase clear workflow and opportunity in email marketing that you

00:04:51can potentially automate.

00:04:52They survey hundreds of email marketers from top companies to understand exactly how AI

00:04:57is reshaping their workflows.

00:04:58They talk about why marketers are still doing a lot of heavy editing, what were the cause

00:05:03to it, as well as the biggest challenges they are facing today when implementing AI in the

00:05:06email marketing.

00:05:07And each of them is a big opportunity for you to build a fully atop agent.

00:05:11They even dive into the specific KPI that they care more about and AI has shown proven

00:05:15results.

00:05:16As well as what exactly things email marketers are really want from AI.

00:05:20So if you are a builder who are thinking about the next big agent product to build, I highly

00:05:24recommend you go check out this awesome resource.

00:05:27I have put the link in the description below for you to download for free.

00:05:30And thanks HubSpot for sponsoring this video.

00:05:32Now let's get back to harness engineer for long running agent systems.

00:05:36And at high level, there are three learnings I took away from those.

00:05:39One is that for long running task agents, the critical part of system design is creating

00:05:44this legible environment where each sub agent or sessions can actually understand where things

00:05:49are at.

00:05:50Most likely there are some workflows that can be done to enforce legibility of the environment.

00:05:54And I'll explain a bit more on that.

00:05:56The second is verification is critical.

00:05:58You can improve system output significantly by allowing it to verify its work effectively

00:06:03with faster feedback loop.

00:06:04And third is that we need to trust the model more instead of building specialized tooling

00:06:08they wrap a lot of reasoning and logic prematurely.

00:06:11We should give model max contacts with generic tooling that they natively understand and let

00:06:16it just explore like humans.

00:06:17And I'll unpack those three things one by one as we go through each block here.

00:06:20First is Entropic's effective harness for long running agents blocks.

00:06:24So they've experimented using cloud code SDK to build a specialized agent for super long

00:06:29running tasks like build a clone of cloud.ai website.

00:06:32The very first failures they observe is that firstly, agents tend to do too much at once.

00:06:37Essentially, it will always try to one shot the whole app.

00:06:40And this led to the model running out of context in the middle of its implementation and leaving

00:06:45the next session to start with the feature half implemented or documented.

00:06:49Then the agent would have to guess what actually happened and spend substantial time trying

00:06:52to get the basic app working again.

00:06:55And second failure they observe is that agents tend to declare job complete prematurely.

00:07:00You probably experienced this a few times yourself as well.

00:07:02The cloud code or cursor will just claim the purger or feature is completed.

00:07:05But once you test it, it actually didn't work.

00:07:07So their approach to solve those default model failure behavior is that firstly set up initial

00:07:12environment that lays the foundation for all the features that given prompt required, which

00:07:16set up the agent to work step by step and feature by feature.

00:07:20So this is kind of similar to the plan or PRD approach that we normally took.

00:07:23The second is that it start prompt each agent to make incremental progress towards its goal

00:07:27while also leaving the environment in a clean state at the end of each session.

00:07:32What they did is starting design this two part solution.

00:07:35They will have this initializer agent that uses a specialized prompt to ask model to set

00:07:40up initial environment with the init.sh script, which will set up dev server, for example,

00:07:45so that next model don't need to worry about those things.

00:07:48And also a cloud progress.txt file that keeps logs on what agent have done as well as initial

00:07:53git commit that shows what file has been added.

00:07:55Then a coding agent for each subsequent session to ask model to make incremental progress,

00:08:01then leave structured updates.

00:08:02And all those efforts are really trying to serve one purpose is how can they define an

00:08:07environment where agents can quickly understand the state of work when starting with a fresh

00:08:11context window.

00:08:13So the workflow is that the initializer agent would firstly try to set up a environment or

00:08:17you can call it documentation system to track and maintain overall plan.

00:08:21And the environment they design here is firstly they will have a feature list documents to

00:08:25prevent agent one shotting the whole app or prematurely considering the project complete.

00:08:30And they will get the initializer agent to break down the project into over 200 features

00:08:34and log them in a local JSON file looks something like this, where each task has detailed spec

00:08:39as well as pass or fail state.

00:08:41At default, all tasks will be marked as fail.

00:08:43So force model to always look at overall project goal and the progress pick up highest priority

00:08:49task and do the next thing.

00:08:50But to make this workflow works, they also need a way to force a model, leave the environment

00:08:55in a clean state after making the code change in their experiments, they found the best way

00:08:59is to ask model to commit progress to get with descriptive comment message and write

00:09:05a summary of his progress in progress file, but was it just documentation and contact

00:09:08environment itself is not enough because model at default has this tendency to mark something

00:09:13as completed without proper testing and at the beginning, they were just prompting cloud

00:09:17code to always do the tests after the code change by doing unit test or API tests for

00:09:22the dev server.

00:09:23But all those things will often fail to recognize that a feature is not working end to end.

00:09:27Because things really start changing when they give model proper tooling to do the end

00:09:30to end test by itself, like Puppeteer MCP or Chrome dev tool, where agent was able to

00:09:35identify and fix bugs that were not directly obvious from the code itself.

00:09:39So basically, they are setting up the structure where they have the initialized agent to break

00:09:43down the user's goal into a list of features alongside init.sh to be able to run the dev

00:09:47server and progress files.

00:09:49So the next coding agent can just read the feature list to get an understanding about

00:09:53the overall project plan and pick up high priority tasks and progress file and get a log to understand

00:09:57where things are at.

00:09:59Then run init.sh to start dev server immediately and do end to end test to verify the environment

00:10:04is clean so that it can get a full picture, faster feedback loop while each new session

00:10:09and context window happen.

00:10:10In OpenAI's blog, they talk about very similar things.

00:10:13You have to make sure your application environment is legible.

00:10:16They make the whole repository knowledge the system or record.

00:10:19Initially, they put a gigantic agents.md file and it failed in predictable ways because it's

00:10:23just too much context for any agent to manage and maintain.

00:10:27So what they did is design a proper doc environment structure and treated agents.md file as a table

00:10:32of contents.

00:10:33So they set up this documentation system from architectures, the design docs, the execution

00:10:37plan, DB schema, product specs, and design front-end plan, security, and many more, and

00:10:42put this table of content into agents.md file so the agent can actually retrieve back random

00:10:47information when needed.

00:10:49And this enables progressive disclosure and OpenAI actually do it even further.

00:10:53They will try to push not only the code knowledge, but also Google Docs, Slack message, all those

00:10:58other fragmented information, feed the data into the repository as a repository local version

00:11:03artifacts.

00:11:04So the agent can also retrieve because from agent point of view, if anything can be accessed

00:11:09in the environment, then effectively it didn't exist.

00:11:11But again, documentation itself didn't really keep a fully agent-generated codebase coherent.

00:11:16They also introduced certain programmatic workflow to enforce invariants.

00:11:20For example, they layer domain architecture with explicit cross-cutting boundaries, which

00:11:25allow them to enforce those rules with custom checks, linters, and structural tests, which

00:11:29can be automatically triggered and injected by every git pre-commit.

00:11:33And those type of architecture usually you will postpone until you have hundreds of engineers

00:11:37in traditional software company, but with coding agent is an early prerequisite.

00:11:41Within those boundaries, you allow teams and agents to significant freedom in how solutions

00:11:46are expressed without micromanaging and worry the architecture going to drift.

00:11:49Meanwhile, they also improve the codebase a lot.

00:11:52For example, they made the app bootable per git work trees, so codecs can just launch and

00:11:55drive many different instances.

00:11:57And they also wired a Chrome dev to protocol into the agent runtime so that the agent can

00:12:01reproduce bugs, valid fix, buy DOM snapshots, screenshots, and navigation.

00:12:05And with also environment and workflow set up, the repository finally crossed a minimum

00:12:09threshold where codecs can end-to-end drive a new feature.

00:12:13So every time when codecs receive a single prompt, the agent will start validating the

00:12:17current state of the codebase, reproduce a reported bug, record a video to demonstrate

00:12:21the failure, implement the fix, validate the fix by driving the application, record a second

00:12:25video demonstrating the resolution, and eventually merge the change.

00:12:29So those two blocks showcase very good learnings and necessary harness systems you need to put

00:12:32in place for a fully autonomous system.

00:12:34Meanwhile, there are also certain learnings.

00:12:36Quite often when we build agents, especially vertical-specific agents, our tendency is to

00:12:40build specialized tooling to do domain-specific tasks.

00:12:43The learning goal is that large learning models almost always work better with generic tools

00:12:47that they natively understand.

00:12:49Vercel released this awesome article about how they redesigned their tasks to SQL agents.

00:12:53So they spent months building sophisticated internal attacks to SQL agent D0 with specialized

00:12:58tool heavy prompt engineering and careful context management.

00:13:02But as many of us experienced before, those type of systems kind of work but is very fragile,

00:13:06slow, and require constant maintenance.

00:13:09Because every new edge case has happened, you will need to inject a new prompt to the agent.

00:13:12But later they tried one thing that totally changed the trajectory.

00:13:15They deleted most of the specialized tool from the agent down to a single batch command tool.

00:13:20And with this much simpler architecture, the agent actually performed 3.5 times faster with

00:13:2537% fewer tokens and success rate increased from 80% to 100%.

00:13:30Similar learning has been shared from Entropic team as well where they talk about instead

00:13:34of having specialized search-linked executed tools, they just have one batch tool where

00:13:38they can run grip, tail, npm, npm run linked.

00:13:41And fundamentally, I think it's because also the large learning model is much more familiar

00:13:45with those code native tools that has billions of training tokens versus bespoke tool calling

00:13:49JSON that it needs to generate.

00:13:51And I've talked about this in programmatic tool calling video that I released last week.

00:13:55And I believe it is similar fundamental principles here, but the foundation of those simple architecture

00:13:59is again the good context and documentation environment where model can use generic tools

00:14:05to retrieve context progressively.

00:14:06And it is same case with OpenCloth.

00:14:09One reason OpenCloth is so interesting is that they have a surprisingly simple but effective

00:14:13context environment.

00:14:15They have list of documentation to store core information with this foundation.

00:14:18They only have the most basic tooling like read, write, edit files, run batch commands

00:14:23and send messages.

00:14:24All the risk is coming from giving agent environment to retrieve relevant context plus a big skill

00:14:29libraries to expand capabilities.

00:14:31So those are three practical learnings about how to do harness engineer for long-running

00:14:35complex agents.

00:14:36By setting up a legible context environment to enable each session to grab context effectively

00:14:41and write workflow and tooling so that model can verify its work effectively, drive faster

00:14:46feedback loop and trust agent with generic tools that it natively understands.

00:14:50If you're interested, I'm going to share more in depth about how do I take these learnings

00:14:54and transform into a development lifecycle process.

00:14:58In AI Builder Club, we have courses and work through about vibe coding and building production

00:15:02agents.

00:15:03And every week, myself and industry experts share the latest practical learnings.

00:15:08So if you're interested in learning what I'm learning every day, you can click on the link

00:15:12below to join the community.

00:15:13I hope you enjoyed this video.

00:15:14Thank you and I'll see you next time.

Description

Get free AI Adoption in Email Marketing Report: https://clickhubspot.com/cb84e9 🔗 Links - Join AI Builder Club: https://www.aibuilderclub.com/ - Try Superdesign: http://superdesign.dev/ - Follow me on twitter: https://twitter.com/jasonzhou1993

Community Posts

Write about this video