Harness Engineering: The Skill That Will Define 2026 for Solo Devs

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

SSolo Swift Crafter

Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00So, okay.

00:00:02What's the best AI model right now?

00:00:04Claude, GPT, Gemini.

00:00:07And honestly, I think that's the wrong question.

00:00:11Like, completely the wrong question.

00:00:14Just real quick, I'm Daniel.

00:00:16I've been deep in iOS dev for over eight years now.

00:00:20Started out freelancing, designing UIs,

00:00:24bouncing from client to client,

00:00:25shipping other people's ideas

00:00:27while trying to figure out my own.

00:00:28And then after www.25, I just went all in, solo.

00:00:33No more clients, no safety net.

00:00:36Since then, I've crafted over 15 of my own apps,

00:00:39all Swift UI, all built in public.

00:00:41And right now, honestly, every ounce of energy I've got

00:00:44goes into making this solo studio

00:00:46into something that actually lasts.

00:00:49Not another round of quick MVPs or AI-generated slop,

00:00:52but real apps that hold up in scale.

00:00:55And yeah, all of that process.

00:00:57The whole messy journey lives on crafterslab.

00:01:00It's at crafterslab.dev,

00:01:01and it's not some tutorial graveyard or AI clone factory.

00:01:06It's genuinely my home base,

00:01:08built for solo devs who use AI like a real teammate.

00:01:12Not like a vending machine you poke when you're stuck

00:01:14and hope for the best.

00:01:16If you care about the craft,

00:01:18if you're serious about leveling up

00:01:20and building things that actually last,

00:01:23yeah, you'd feel right at home.

00:01:24And hey, if you're still on Patreon,

00:01:26huge thanks for that, but heads up.

00:01:29Everything's moved over to crafterslab.dev.

00:01:32That's where the whole crew is now.

00:01:33Come build with us.

00:01:35So here's what got me thinking about all this.

00:01:38There was a study that came out recently.

00:01:41Researchers published this benchmark called Epic's Agent.

00:01:45And what makes it different from every other benchmark

00:01:49you see people arguing about online

00:01:51is that it tests agents on real professional work,

00:01:55not coding puzzles, not multiple choice.

00:01:58We're talking actual tasks that consultants, lawyers,

00:02:03analysts do on a daily basis.

00:02:05Each one takes a human about one to two hours to complete.

00:02:08So they ran every major frontier model through it.

00:02:11The best one completed those tasks

00:02:13about 24% of the time, one in four.

00:02:17And after eight attempts with the same model,

00:02:20it only climbed to around 40%.

00:02:23Now keep in mind, these are the same models

00:02:26scoring above 90% on the benchmarks

00:02:29everyone loses their minds over.

00:02:32So either those benchmarks are off

00:02:33or we're measuring the wrong thing.

00:02:36And I think it's the second one, right?

00:02:37But okay, so here's where it gets real for us.

00:02:41The researchers actually dug into why the agents failed.

00:02:46And the answer wasn't that the models are dumb.

00:02:49They had all the knowledge they needed.

00:02:51They could reason through the problems just fine.

00:02:54The failures were almost entirely

00:02:56about execution and orchestration.

00:03:00The agents would get lost after too many steps.

00:03:02They'd loop back to approaches that already failed.

00:03:05They just lose track of what they were even supposed

00:03:09to be doing in the first place.

00:03:11And if you're a solo dev using Claude code

00:03:14or cursor every day, yeah, you've been there.

00:03:18You've watched the agent spiral retry the same

00:03:21broken thing three times,

00:03:23completely forget the context from 20 steps ago.

00:03:26And you're sitting there like,

00:03:28maybe I should switch to Opus.

00:03:30Maybe I need a different provider,

00:03:32but the data is saying that's not it.

00:03:34The model isn't the bottleneck.

00:03:36It's everything wrapped around it.

00:03:38And there's a word for that.

00:03:40And I think it's gonna define 2026

00:03:43the way agents define 2025.

00:03:46The word is harness.

00:03:47An agent harness sees all the infrastructure

00:03:50around the model, what it can see,

00:03:52what tools it has access to,

00:03:54how it recovers when things go sideways,

00:03:56how it keeps track of what it's doing over a long session.

00:03:59OpenAI literally published a blog post

00:04:02called Harness Engineering.

00:04:04Anthropic put out a whole guide on building effective

00:04:07harnesses for long running agents.

00:04:09Manish, the AI company Meta just acquired,

00:04:13they published their context engineering lessons

00:04:16after rebuilding their entire agent framework

00:04:19five times in six months, five times.

00:04:22And they're all saying the exact same thing.

00:04:24The harness is where the real engineering work lives,

00:04:27not the model.

00:04:28Okay, so, and this is the part that honestly surprised me

00:04:32because it runs completely counter

00:04:34to how most of us think about building with these tools.

00:04:38So there's this story from Vercel.

00:04:41They had a text to SQL agent.

00:04:43You ask a question, it writes a SQL query,

00:04:46and they built it the way most people build agents, right?

00:04:49Gave it a bunch of specialized tools,

00:04:51one for understanding the database schema,

00:04:54one for writing queries, one for validating results.

00:04:58All this error handling had wrapped around it

00:05:01and it worked about 80% of the time.

00:05:04Then they tried something kind of radical.

00:05:06They removed 80% of the tools, just ripped them out,

00:05:11gave the agent basic stuff, run bash commands, read files,

00:05:15standard command line tools like grep and cat,

00:05:18the kind of stuff you or I would actually use.

00:05:20And accuracy went from 80% to 100%.

00:05:25It used 40% fewer tokens,

00:05:28and it was three and a half times faster.

00:05:31Not gonna lie, that's kind of wild, right?

00:05:33And the engineer who built it said something

00:05:36that really stuck with me.

00:05:38Models are getting smarter.

00:05:40Context windows are getting larger.

00:05:42So maybe the best agent architecture

00:05:44is almost no architecture at all.

00:05:46And that just flips everything, you know what I mean?

00:05:50Because the instinct, especially when you're solo

00:05:54and you're trying to make this thing reliable,

00:05:57is to keep adding more tools, more guardrails,

00:06:01more routing logic.

00:06:02You think more structure is gonna help,

00:06:04but those tools weren't helping the model.

00:06:06They were getting in the way.

00:06:08And this isn't an isolated thing either.

00:06:10Manus went through the exact same realization.

00:06:13They rebuilt their entire agent framework

00:06:16five times in six months,

00:06:19and their biggest performance gains

00:06:21didn't come from adding features.

00:06:23They came from removing them.

00:06:25They ripped out complex document retrieval,

00:06:28killed the fancy routing logic,

00:06:29replaced management agents with simple structured handoffs.

00:06:34Every iteration, the thing got simpler and it got better.

00:06:37And here's the part I think every solo dev

00:06:40running long, clawed code sessions needs to hear.

00:06:42Manus found that their agent averaged

00:06:45around 50 tool calls per task.

00:06:49That's a lot of steps.

00:06:50And even with models that technically support

00:06:53huge context windows,

00:06:54performance just degrades past a certain point.

00:06:58The model doesn't suddenly forget everything.

00:07:01It's more like the signal gets buried under noise.

00:07:04Your important instructions from the start of the session

00:07:07get lost under hundreds of intermediate results.

00:07:10So their fix was dead simple.

00:07:12They started treating the file system

00:07:14as the model's external memory.

00:07:17Instead of cramming everything into the context window,

00:07:20the agent writes key info to a file

00:07:23and reads it back when needed.

00:07:25And yeah, if you use clawed code,

00:07:27you've literally seen this.

00:07:29The clawed.md files, the to-do lists, the progress tracking,

00:07:34that's this exact pattern playing out

00:07:36in your terminal every day.

00:07:37All right, so remember what I said

00:07:40about everyone converging on the same idea?

00:07:44Because when you look

00:07:45at the three most successful agent systems right now,

00:07:49they all arrived at the same place

00:07:51from completely different directions.

00:07:53Codex from OpenAI, it's got this layered approach.

00:07:57An orchestrator that plans,

00:07:59an executor that handles individual tasks,

00:08:02and a recovery layer that catches failures.

00:08:06It's robust.

00:08:07You can hand it stuff and walk away.

00:08:09That's one philosophy.

00:08:10Clawed code, and I use this every single day.

00:08:14The core is literally just four tools.

00:08:16Read a file, write a file, edit a file,

00:08:19run a bash command, that's it.

00:08:21Most of the intelligence lives in the model itself.

00:08:23The harness stays minimal.

00:08:25And when you need more, extensibility comes through MCP

00:08:28and skills that the agent picks up as needed.

00:08:30And then Manish landed on what I'd call

00:08:33reduce, offload, isolate, actively shrink the context,

00:08:38use the file system for memory,

00:08:40spin up subagents for heavy tasks,

00:08:43and just bring back the summary.

00:08:45Three totally different approaches,

00:08:47all converging on the same insight.

00:08:50The harness matters more than the model.

00:08:52And for solo devs,

00:08:55this changes what you should actually

00:08:57be spending your time on.

00:08:59Because, you know, we don't have infinite hours.

00:09:01Every hour you spend on Reddit debating

00:09:05clawed versus GPT is an hour you're not shipping.

00:09:08And there's this idea from Richard Sutton,

00:09:11one of the creators of reinforcement learning,

00:09:14called the bitter lesson.

00:09:16The core argument is that

00:09:18approaches which scale with compute

00:09:21always end up beating approaches

00:09:23that rely on hand-engineered knowledge

00:09:26applied to what we're doing.

00:09:27That means something very specific.

00:09:29As models get smarter,

00:09:31your harness should get simpler,

00:09:33not more complex.

00:09:34If you're adding more hand-coded logic,

00:09:36more custom pipelines with every model upgrade,

00:09:40you're swimming against the current.

00:09:42And honestly, that over-engineering

00:09:44is probably why your agent keeps breaking.

00:09:47So here's what I'd actually try.

00:09:49First, do the Vercel experiment yourself.

00:09:52If you've got any kind of agent set up,

00:09:54strip it down, remove the specialized tools,

00:09:57give it a bash terminal and basic file access

00:10:00and just see what happens.

00:10:02The model is probably smarter

00:10:03than the tool pipeline you built around it.

00:10:06Second, add a progress file.

00:10:08Have your agent maintain a running to-do list

00:10:10that it updates after each step.

00:10:13It reads the file at the start of each action,

00:10:15writes to it at the end.

00:10:17This is exactly what clawed code does

00:10:19with those markdown files.

00:10:20And it's the same pattern man has landed on

00:10:22after five complete rewrites.

00:10:24I actually have a whole system for this

00:10:26wired up in the lab with all my agent instructions

00:10:29and .md templates, ready to go if you're curious.

00:10:33And third, start learning about M, CP, and skills.

00:10:37These give the model clean, standardized ways

00:10:40to work with external tools

00:10:42without you having to hard code every integration.

00:10:44That's where the extensibility lives now.

00:10:462025 was the year of agents.

00:10:50And for the most part, yeah, that happened.

00:10:53But 2026, I think 2026 is the year of harnesses

00:10:58and the same model, the exact same model

00:11:03behaves completely differently in clawed code

00:11:06compared to cursor or compared to codecs.

00:11:08So choose your harness carefully,

00:11:11whether you're using a coding agent or building one.

00:11:14And so, yeah, if you're still here,

00:11:17honestly, you're a legend.

00:11:18And look, I know the model discourse is loud right now.

00:11:22Every week there's a new drop, a new benchmark,

00:11:24a new thread about which one is king.

00:11:27But the actual data, the actual engineering

00:11:30coming out of the companies building this stuff,

00:11:32it's all pointing somewhere else.

00:11:34The harness is where the wins are.

00:11:37And as solo devs, that's actually great news

00:11:40because building a better harness

00:11:42is something you can do right now today

00:11:45without waiting for the next model release.

00:11:47And if you wanna go deeper into how I actually

00:11:51set all this up, the .MD files, the agent workflows,

00:11:56how I wire everything together for my own apps,

00:11:59come check out crafterslab.dev.

00:12:02It's not some tutorial dump or another AI content farm.

00:12:06It's genuinely my home base built for solo devs

00:12:09who treat AI like a real teammate

00:12:11and actually care about what they ship.

00:12:13Inside, you get full walkthroughs,

00:12:15real short video tutorials, a bunch of clawed code skills

00:12:19you can grab and use right away,

00:12:21and downloadable resources you can drop

00:12:24straight into your projects.

00:12:26Members riff in the comments, ask followups,

00:12:29go back and forth.

00:12:30It's a real conversation, not some one-way content feed.

00:12:34But the real core, the Notion team spaces,

00:12:37my live playbook, you get a front row seat

00:12:40to how I run every single app I'm building,

00:12:42the actual .MD files I use on real projects,

00:12:46the prompt library, the docs I'm writing as I go,

00:12:49all the automations running behind the scenes,

00:12:51nothing polished for the camera, just the real process,

00:12:55messy parts and all, and their Swift brain,

00:12:58a curated Swift and Swift UI library

00:13:01I've been building out for years, deep dive keynotes,

00:13:04private talks I spent real money curating,

00:13:07the kind of material that's not floating around

00:13:10in public training data.

00:13:11This is what I actually use to build custom MCPs

00:13:16to set up skills for Clawed Code, for Cursor, all of it,

00:13:20always experimenting, always sharing what sticks,

00:13:23and then Ops Lab.

00:13:25That's where all the AI agent instructions live,

00:13:28the Notion templates, the Clawed Code skills,

00:13:31the workflows, automations all wired up

00:13:33and ready for you to copy, tear apart,

00:13:36totally break and rebuild your own way.

00:13:38The whole point is keeping the indie stack connected

00:13:41so you're never really building alone,

00:13:44even if you're solo at the keyboard.

00:13:46So yeah, if you wanna get in while the crew's still small

00:13:49and prices are locked, now is kind of the sweet spot.

00:13:52It feels way more like a behind the scenes dev lounge

00:13:55than some giant faceless forum

00:13:57would genuinely love to see you in there.

00:14:00Trade some takes on this harness stuff,

00:14:02maybe learn something from what you're building next.

00:14:05Keep crafting, keep experimenting,

00:14:08and don't let the benchmark noise distract you

00:14:10from what actually matters.

00:14:12Peace.

Description

Come be part of the crew: https://www.crafterslab.dev So... what's the best AI model right now? Claude? GPT? Gemini? I think that's the wrong question. And the data backs it up. A new study tested every major frontier model on real professional tasks — and the best one only completed 24% of them. Meanwhile, Vercel stripped 80% of the tools from their agent and accuracy jumped from 80% to 100%. Manus rebuilt their entire framework five times in six months. The lesson every time? The model isn't the bottleneck. The harness is. In this video I break down why the infrastructure around your AI model matters more than the model itself, what Vercel, Manus, OpenAI and Anthropic all figured out, and three things you can try today to make your coding agent actually reliable. 0:00 — Hook 0:15 — Who I am + Crafter's Lab 1:15 — You Keep Chasing the Best Model — And Your Agent Still Breaks 4:45 — Your Agent Chokes Because You Gave It Too Much 8:15 — Stop Agonizing Over Models — Fix What's Around Them 12:00 — Outro + Crafter's Lab Deep Dive Join the crew at [crafterslab.dev](http://crafterslab.dev) — my live Notion workspace, prompt libraries, agent setups, SwiftUI patterns, Claude Code skills, and more. Built for solo devs who use AI like a real teammate. Follow me: Instagram: [instagram.com/soloswiftcrafter](http://instagram.com/soloswiftcrafter) https://www.solocrafter.shop Check Out Crafter Apps portfolio: https://www.crafterapps.com Check out my Appstore Apps : Crafter Console: Trends - https://apps.apple.com/app/id6751136548 ScrollCraft: https://apps.apple.com/us/app/scrollcraft-developer-tool/id6751137939 Smart NFC: https://apps.apple.com/app/id6747853201 Unmissable: https://apps.apple.com/app/id6752518796 Think-Fast Math: https://apps.apple.com/app/id6751339803 Smart QR: https://apps.apple.com/app/id6747853201 Smart Memo: https://apps.apple.com/app/id6748328601 SymbolCraft: https://apps.apple.com/app/id6751123632 App Icon maker: https://apps.apple.com/us/app/appicon-maker-developer-tool/id6748861741 Colck Widget Creator: https://apps.apple.com/us/app/clock-widget-creator/id6751095006

Community Posts

Write about this video