I Hated Every Coding Agent, So I Built My Own — Mario Zechner (Pi)

MMastra
컴퓨터/소프트웨어창업/스타트업AI/미래기술

Transcript

00:00:00[MUSIC PLAYING]
00:00:02Hi, my name is Mario.
00:00:04I hail from the land of Arnold Schwarzenegger,
00:00:06which you probably haven't noticed yet
00:00:09based on my very good English.
00:00:12I want to preface this with we've
00:00:13been running around with our four-year-old the entire day
00:00:16through London.
00:00:17So we went to dinosaurs, mummies, Nandos, obviously,
00:00:24and stuff I have already forgotten.
00:00:26I'm very, very tired.
00:00:28And if you don't understand anything I say,
00:00:31just raise your hand and say, grandpa, wake up.
00:00:36The reason I'm here is actually another person,
00:00:39which is here in Cockneyville today.
00:00:40Let's call him Shteter Pineburger.
00:00:44Back in 2025, I think somewhere around April,
00:00:53he told me and Armin Ronecha, which you might also know
00:00:58of Flask fame and Sentry fame, dude, those coding agents,
00:01:02they actually work now.
00:01:04And I was like, oh, shut the fuck up.
00:01:06Sorry, I'm also using swear words.
00:01:09Totally not.
00:01:10And a month later, we teamed up at this flat for 24 hours
00:01:13overnight and just let ourselves get immersed by the clankers,
00:01:19by the wipe code, and by the wipe slop.
00:01:21And since then, none of us have really--
00:01:23we're sleeping anymore, basically.
00:01:27So we were billing stuff, lots of stuff, most of which
00:01:32we actually never used, because that's the new thing in 2025,
00:01:36'26.
00:01:37We bill a lot of stuff, but we don't bill a lot of stuff
00:01:39we actually use.
00:01:40We wrote a lot of stuff.
00:01:42And eventually, that culminated in me thinking,
00:01:46hey, I hate all the existing coding agents or harnesses.
00:01:50How hard can it be to write one myself?
00:01:53And Peter was like, oh, I just want to do a thing.
00:01:56Nobody's probably going to hear about it.
00:01:58And it's going to be a personal assistant,
00:02:01because that's what I've always wanted to have.
00:02:03Most of you probably know how his story went.
00:02:05So today, I'm going to tell you my much less impressive story.
00:02:08But I hope I can transport a couple of learnings,
00:02:11as we see in the industry, that I was able to gather
00:02:16in the past couple of months.
00:02:17So Pi.
00:02:19In the beginning, there was Cloud Code.
00:02:21Actually, there was copy and pasting from JetGPT.
00:02:25We all did that in the beginning, 2023.
00:02:27Then there was-- who remembers the original GitHub co-pilot?
00:02:32Yeah, actually, how many of you are engineers?
00:02:35How many of you are using coding agents,
00:02:37like Cursor, Cloud Code?
00:02:39OK.
00:02:40Popularity contest, Cloud Code?
00:02:43Codex CLI?
00:02:45Cursor?
00:02:48Open--
00:02:48[INAUDIBLE]
00:02:49Yeah.
00:02:50Open code?
00:02:50Anti-gravity.
00:02:51Oh, that's not a lot.
00:02:52Anybody using this?
00:02:55I like you.
00:02:56We're going to have a beer later.
00:02:58Anyway, so this was basically what happened in 2025
00:03:03and before.
00:03:04Started with copy and pasting from JetGPT.
00:03:06It's all mostly broken.
00:03:07It's mostly single functions, stuff you don't want to write.
00:03:10Then you got GitHub co-pilot inside of your Visual Studio
00:03:13Code, where you just tap, tap, tap to happiness,
00:03:15which did work sometimes, mostly didn't.
00:03:17Sometimes it will also just [INAUDIBLE] recite GPL code,
00:03:22like John Carmack's inverse square root
00:03:25and stuff like that, which was a lot of fun.
00:03:29And then there was Adir.
00:03:30Anybody remember Adir?
00:03:31Yes.
00:03:32Old people.
00:03:33Hello.
00:03:33Yeah.
00:03:37You have gray hair.
00:03:37You obviously know Adir.
00:03:41There was also AutoGPT.
00:03:43Probably not a lot.
00:03:44Yeah, OK.
00:03:45He knows all the things.
00:03:48And then eventually there was Cloud Code.
00:03:51I think they released it in November,
00:03:52actually, as a beta in 2024.
00:03:55But it really only became used more, say again?
00:03:59Only February.
00:04:01Yeah, February, March, something like that, 2025.
00:04:03And I was like, I love it.
00:04:05It's awesome.
00:04:06The Cloud team is also awesome.
00:04:07They're on socials.
00:04:08And they're all very good people and very talented people.
00:04:13And they basically created the entire genre.
00:04:15I know there were precursors like Adir and AutoGPT,
00:04:18but nothing did this.
00:04:20And this was basically the whole agentic search thing.
00:04:22So instead of like cursor going into your code base,
00:04:25indexing things, constructing ASTs, and indexing that as well.
00:04:29And it's kind of not really working.
00:04:31They just said, eh.
00:04:33We reinforcement trained our models
00:04:35to just use file tools, bash tools,
00:04:37to explore your code base ad hoc and find the places that it
00:04:41needs to find to understand the code and then modify the code.
00:04:44And this worked so well that, yeah, we
00:04:46stopped sleeping because we all of a sudden
00:04:48could produce so much more code than we could before by hand.
00:04:52Back then, it was simple and predictable
00:04:54and actually fit my workflow perfectly.
00:04:57Fine.
00:04:58But then they fell into the trap to which most of us
00:05:05probably fall.
00:05:06The clankers can write so much code.
00:05:08Why not just let it write all the features you could ever
00:05:11imagine, right?
00:05:11Isn't that great?
00:05:12Let's just add this feature, and that feature,
00:05:14and this feature, and that feature.
00:05:15And eventually, you end up with Homer Simpson's--
00:05:18I don't even know what it's called.
00:05:20I call it a spaceship.
00:05:21And Cloud Code is now a spaceship.
00:05:23It does so many things that you actually probably ever
00:05:26use like 5% of what it offers.
00:05:28You only know about 10% in total.
00:05:30And the rest, the 90% that's left over,
00:05:33that's kind of like the dark matter of AI and agents.
00:05:36Nobody knows what it's actually doing.
00:05:37And I personally find this not to be very helpful
00:05:40because I still think that you kind of need
00:05:43to know what the agent is doing.
00:05:45This guy might disagree to some degree.
00:05:49And we're here at TESOL, and they also
00:05:51like context management or context engineering,
00:05:54as we've called it.
00:05:55And I eventually found that Cloud Code was not
00:05:58a good tool when it comes to observability
00:06:01and actually managing your context.
00:06:04Then there was also this.
00:06:06Who likes this about Cloud Code, like the immense amounts
00:06:09of flicker, unexplainable flicker?
00:06:10Well, actually, I know how to explain it and why it happens,
00:06:13but they still haven't fixed it.
00:06:15Here's Tarik.
00:06:16He's really great.
00:06:16I love him.
00:06:17He's their DevRel guy, mostly on Twitter, and he's amazing.
00:06:21But sometimes he also says questionable stuff
00:06:24like, our terminal user interface is now a game engine.
00:06:27Now, you have to know I have a game development background.
00:06:30That's where I come from.
00:06:31And if I read something like this,
00:06:32then it kind of hurts me a little bit
00:06:34because it's a freaking terminal user interface, dude.
00:06:37It's not a game engine.
00:06:38Trust me.
00:06:39The only reason you think it's a game engine
00:06:41is because you're using React in your terminal interface,
00:06:44and it takes like 12 milliseconds
00:06:45to relay out your entire user interface graph.
00:06:49Just don't do that, man.
00:06:51It's not a game engine, right?
00:06:54So and then Mitchell, who is writing Ghosty,
00:06:56was like, dude, that's offensive, man.
00:06:59Like, don't blame it on Ghosty or any other terminal.
00:07:02Your code is garbage.
00:07:04Terminals can render at like hundreds
00:07:05of frames per second, sub-milliseconds per frame.
00:07:09So don't do that, right?
00:07:12And then they eventually fixed the flicker.
00:07:15But then other stuff happened.
00:07:16So it's like they fully gave in to the vibe coding.
00:07:20And you can feel it every day when you use Cloud Code.
00:07:23Now, again, I do not want to diminish their efforts
00:07:27and their results.
00:07:28Cloud Code is still the category leader for a good reason.
00:07:30They invented this thing, and they're doing a great job.
00:07:32I personally am just an old person
00:07:34who likes predictable simple tools.
00:07:37And this just didn't fit my workflows and my needs anymore.
00:07:41So yeah.
00:07:42Also, they do a lot of stuff in the background,
00:07:44manipulating your context.
00:07:46I built a bunch of tools in summer 2025
00:07:50that would allow me to intercept requests being made
00:07:52to their back end from Cloud Code and finding out
00:07:55what kind of little additional text
00:07:58gets injected into your contacts behind your back.
00:08:00And all of that was very detrimental
00:08:01and also changed all the time.
00:08:04Like every day or second day, there
00:08:06would be a new release where this changed what
00:08:08gets injected at what point, which would basically mess
00:08:11with your existing workflows.
00:08:13It was just not a stable tool.
00:08:14And now I understand it from their perspective.
00:08:16They need to experiment.
00:08:17And they have a huge user base.
00:08:18And it's really hard to experiment
00:08:19when you have a huge user base.
00:08:21But they did not care.
00:08:23So all of us had to suffer.
00:08:25You're working with this new tool.
00:08:27You try to create predictable workflows.
00:08:31And then the tool vendor changes a tiny little thing
00:08:35under the hood that makes the LLM go
00:08:36crazy with your existing workflows.
00:08:38That's just not sustainable.
00:08:39I need control over that.
00:08:40I can't rely on them providing me a stable kind of thing.
00:08:46So I believe, as a consequence of the UI design,
00:08:52they need to reduce the amount of visibility you have.
00:08:54I personally don't like that too much.
00:08:56But that's just a personal preference.
00:08:57I understand that most people will
00:08:58be happy with the amount of information
00:09:00that Cloud Code will present you.
00:09:03There is zero model choice, obviously,
00:09:06because it's an anthropic native tool, so to speak.
00:09:09That's not the downside, because Cloud models are--
00:09:12I like them.
00:09:13They're really good.
00:09:15And there's almost zero angst sensibility.
00:09:17And you might find this kind of funny, because they
00:09:19have this whole hook system and all of that.
00:09:21But if you compare it to what Pi allows you to do,
00:09:25it's not as deeply integrated.
00:09:28It's also basically based on running a process when
00:09:32the hook event starts, which is very expensive if you
00:09:36have to start up that process over and over again.
00:09:40So eventually, I soured on Cloud Code,
00:09:42not because it was terrible.
00:09:44It's just it stopped being a fit for me.
00:09:47It became a fit for a lot more people over that period.
00:09:50So obviously, they are doing things right, but not for me,
00:09:54because I'm old.
00:09:56So then I was looking around for options.
00:09:59And there is Codex CLI, which I really didn't like.
00:10:01In the beginning, both the user interface as well as the model,
00:10:05that has changed, at least with respect to the model.
00:10:08Codex is really pretty good now.
00:10:10Then there's AMP.
00:10:12The team behind that used to work at Sourcegraph.
00:10:15They spun off of Sourcegraph.
00:10:20And they're super good engineers.
00:10:21They managed to build a commercial coding harness where
00:10:25they take away features instead of adding them.
00:10:28And most of their choices make a lot of sense to me.
00:10:33So yeah, if you're looking for a commercial coding harness,
00:10:36I would definitely recommend AMP to you, because it's really good.
00:10:39Factory Troye, kind of similar spiel, also really good,
00:10:44although they are not as experimental as AMP.
00:10:47And then there's OpenCode, which is the open source
00:10:50coding harness a lot of people use.
00:10:53So I have a history of open source.
00:10:55I've been in open source for, well, 17 years.
00:11:00I've managed big and small open source projects.
00:11:04So that's near and dear to my heart.
00:11:05And so I thought, I give OpenCode a try,
00:11:08because that's close to me.
00:11:12And next to AMP, they have one of the most grounded
00:11:15or pragmatic teams in the space.
00:11:16They don't hype you up with features
00:11:18you probably never use.
00:11:20They try to kind of conserve a happy path that's
00:11:23very stable.
00:11:26And they also have pretty good thoughts
00:11:27on what coding agents mean for us
00:11:29as a profession, which I personally can identify with.
00:11:32The problem with OpenCode is that it's also not very good
00:11:37at managing your context.
00:11:38For example, on each turn, it's calling sessionCompaction.prune,
00:11:44which does the following.
00:11:46It prunes all two results before the last 40,000 tokens.
00:11:52Now, who here knows what prompt caching is?
00:11:56What does this do to your prompt cache?
00:11:58So OpenCode and Entropic had an interesting history.
00:12:05And eventually, Entropic, in my opinion, rightly so,
00:12:11said, dudes, that's just not going to happen.
00:12:14And there was never a public kind of thing about this.
00:12:17But Tarek explains it here.
00:12:19If you come to a gym and don't behave and abuse
00:12:22the infrastructure, so to speak, you're going to get banned.
00:12:25And I think--
00:12:27I don't have any evidence for that,
00:12:28but I think that's the reason why
00:12:30there is this animosity between Entropic and OpenCode.
00:12:33And I can totally agree, or at least I
00:12:36think that Entropic is clearly on the right here.
00:12:39Don't mess with the infrastructure.
00:12:42Then there's also other stuff, like OpenCode
00:12:44comes with LSP, Language Server Protocol support,
00:12:46out of the box.
00:12:48Coming back to context engineering,
00:12:51let's say you give your agent the task
00:12:53of modifying a bunch of files.
00:12:55What does that mean in practice?
00:12:57It will make a bunch of edits, one after the other,
00:13:02to a bunch of files.
00:13:03How probable is it that after the first edit, out of 10 edits,
00:13:09so to speak, the code will compile?
00:13:12What happens if you modify your code line by line?
00:13:15How long does it take for it to stabilize again
00:13:17and it compiles cleanly?
00:13:19It doesn't.
00:13:20It won't compile after the first edit, probably not
00:13:22after the second edit, and so on and so forth.
00:13:24So if you then turn around and say, hey, dear LSP server,
00:13:28I just edited one line in this file.
00:13:30Is it broken?
00:13:31Then the LSP server will say, yes, it's really broken.
00:13:34And what this feature does is it then
00:13:36injects this error directly after the tool
00:13:39call as a kind of feedback to the model.
00:13:43Oh, what you just did is wrong.
00:13:45And the model is like, what the fuck, dude?
00:13:47I'm not done editing things.
00:13:49Why are you telling me this?
00:13:50Obviously, it's not wrong.
00:13:51But if you do this often enough, the model will just give up.
00:13:54And that leads to very bad outcomes.
00:13:58So I'm not a fan of LSP.
00:13:59I think it's a very terrible idea to have that enabled.
00:14:02There's natural synchronization points
00:14:03where you want to have linting and type checking
00:14:06and all of that.
00:14:07And that is when the agent think it's done, only then.
00:14:10This has changed recently.
00:14:14This is a single session of open code, where every message
00:14:20becomes its own JSON file.
00:14:22Every single message becomes its own JSON file on disk.
00:14:26That indicates to me that there wasn't a lot of thought put
00:14:29into the architecture of the whole thing.
00:14:31And if I lose trust in that, I don't
00:14:33want to use that tool anymore.
00:14:35Again, I think the team is actually really good.
00:14:37I think they iterated super quickly
00:14:39and built something that's super useful to a lot of people,
00:14:42obviously.
00:14:43It's just, again, decisions that I wouldn't have made that
00:14:46made me decide to build my own.
00:14:50Then there was also this.
00:14:51Open code comes with a server by default.
00:14:54So the core architecture is based on a server.
00:14:56And clients connect to it.
00:14:57And the terminal user interface is one of the clients.
00:15:00There's also a desktop interface.
00:15:01And I don't know.
00:15:03That turned out to be a security vulnerability
00:15:05with remote code execution baked in by default.
00:15:09And that's also-- if you are so proud of your server
00:15:12infrastructure or server architecture,
00:15:15then I would assume you're grown-up engineers that
00:15:18thought about security as well.
00:15:20And apparently, that didn't happen.
00:15:21And this was open for a long time.
00:15:23And again, I'm not claiming anyone here.
00:15:25This is stuff that just happens if you're
00:15:27working in an industry that's operating at a breakneck speed
00:15:31that we haven't seen before.
00:15:33It's just I don't want to use that tool if that is a thing.
00:15:36So this was my observations with regards to existing coding
00:15:42references.
00:15:42AMP and Droid would have been something I could have used.
00:15:45But again, no control.
00:15:47In case of AMP, they even decide what models you can use.
00:15:50And it's only a single model for a single type of task.
00:15:53And that's not me.
00:15:55In terms of Droid, I think it's a little bit more open.
00:15:58But at the time when I tried it out,
00:16:00it just didn't--
00:16:02I didn't see a big advantage over cloud code.
00:16:07And then I looked into benchmarks for entirely different reasons
00:16:10and found terminal bench.
00:16:12Who knows what terminal bench is?
00:16:15OK, basically, it's a coding or an agent evaluation
00:16:20harness, which has a bunch of computer use and programming
00:16:24related--
00:16:24sorry, old and tired because 4-year-old.
00:16:31It has a bunch of computer use and coding related tasks
00:16:35that an agent or the LLM inside an agent harness
00:16:39needs to fulfill.
00:16:40I think it's about 82 or so.
00:16:43And they're very diverse.
00:16:44They're from fix my window setup to code me a Monte Carlo
00:16:48simulation or something like that.
00:16:51And they have a leaderboard.
00:16:52And on that leaderboard, you see the combination
00:16:54of coding agent harness and model.
00:16:57And they have their own coding agent called Terminus.
00:17:03And I think it's brilliant because it's
00:17:06one of the best performing harnesses in the benchmark.
00:17:09We're going to see it later on.
00:17:11What exactly does it do?
00:17:12Well, all the model gets is a TMUX session.
00:17:17And all it can do is send keystrokes to it
00:17:19and read back the VT code sequences that are emitted.
00:17:23So this is like the smallest, most minimal interface
00:17:27a model can have to your computer.
00:17:31And this performs top of the line of the entire leaderboard.
00:17:36So what does this tell us about existing coding agent harnesses?
00:17:39Do we need all these features for the models
00:17:41to actually perform?
00:17:43For me, personally, this is not just about the model actually
00:17:48being good.
00:17:49It's also about me as the user, the human,
00:17:51having a way to interact with my agent with the model.
00:17:54And Terminus is obviously not the user experience or developer
00:17:58experience that I want.
00:18:00But it tells us that all of these features, all of these coding
00:18:03harnesses have might not be necessary to get
00:18:08good results out of agents.
00:18:10So no file tools, no sub-agents, no web search, no nothing.
00:18:13Two theses is based on all of these findings.
00:18:16We are in the messing around and finding out stage.
00:18:18And nobody has any idea what the perfect coding agent should
00:18:21look like or what the perfect coding harness should look like.
00:18:23We're trying both minimalism and going full spaceship swarms
00:18:27and teams of agents and no control and full autonomy
00:18:30and whatever.
00:18:31I think that's not done yet.
00:18:33We haven't answered the question what this
00:18:35should look like ideally and what will become the industry
00:18:37standard.
00:18:38And the second thing is we need better ways
00:18:40to mess around with coding agents.
00:18:42That is, we need them to be able to self-modify themselves
00:18:47and become malleable.
00:18:48So we can quickly experiment with ideas
00:18:50and see if this is something we can make like an industry
00:18:53standard, a new workflow that we probably all are going to adapt.
00:18:58So the basic idea was--
00:18:59and it's very simple, not rocket science--
00:19:01strip away everything and build a minimal extensible core.
00:19:05There's some creature of comfort.
00:19:06It's not a plank slate.
00:19:09So that's pi.
00:19:10And the general motto is adapt your coding agent
00:19:13to your needs instead of the other way around.
00:19:16It comes with four packages, an AI package, which is basically
00:19:21just a simple abstraction over multiple providers, which
00:19:24all speak different transport protocols.
00:19:27So it's very easy to talk to all the providers
00:19:29and switch between them in the same context or same session.
00:19:34The agent core, which is just a generalized agent
00:19:36loop with tooling locations, verification,
00:19:38and so on and so forth.
00:19:39And streaming, a terminal user interface
00:19:42that's like 600 lines of code and works really well,
00:19:47surprisingly, because it wasn't written by a clanker.
00:19:51And the coding agent itself, which is both an SDK
00:19:54that you can use in the headless mode
00:19:57or a full terminal user interface coding agent.
00:20:02This is the entire system prompt.
00:20:05There's nothing more there compared to other coding
00:20:08[INAUDIBLE] system prompts.
00:20:10That's in tokens.
00:20:13It turns out frontier models are heavily RL-trained to know
00:20:16what the coding agent is.
00:20:18So why do you keep telling them that they're a coding agent
00:20:21and how they should do coding tasks, right?
00:20:27YOLO by default, why is that?
00:20:30Most coding agent harnesses at the moment have two modes.
00:20:33Either agent can do whatever it wants
00:20:36or agent gets to ask you, do you really
00:20:40want to delete this file?
00:20:41Do you really want to list the files in this directory,
00:20:44and so on and so forth?
00:20:44And there's different shades of gray here.
00:20:47But at the end of the day, it boils down to the user
00:20:49needs to approve an action by the agent.
00:20:52And then we are safe.
00:20:53And I think that's wrong because that leads to fatigue.
00:20:55And people will either turn it off entirely, YOLO mode,
00:20:58or just sit there and type enter without reading anything.
00:21:01So I don't think that's a solution.
00:21:02Containerization is also not a solution
00:21:04if you're worried about exfiltration of data
00:21:06and prompt injections.
00:21:07But I think that's the only thing that you--
00:21:10I think that's the best basis compared to guardrails
00:21:14like approval or dialogues.
00:21:17It only has four tools, read a file, write a file,
00:21:19edit a file on Bash.
00:21:21Bash is all you need.
00:21:22What's not in there?
00:21:23No MCP, no subagents, no plan, no background,
00:21:25Bash, no built-in to-dos.
00:21:26Here's what you can do instead.
00:21:28For MCP, use CLI tools plus skills,
00:21:30or build an extension, which we will see in a bit.
00:21:34No subagents, why?
00:21:35Because they're not observable.
00:21:36Instead, use tmux and spawn the agent again.
00:21:41You have full control over the agent's outputs and inputs
00:21:44and can see everything that's happening in the subagent.
00:21:48Interesting enough, code spawn--
00:21:50team mode now does exactly this, basically, as well.
00:21:55No plan mode, write a plan MD file.
00:21:57You have a persisted artifact instead
00:21:59of some janky UI that doesn't really
00:22:02fit into your terminal viewport.
00:22:04And you can reuse it across multiple sessions.
00:22:07No background Bash, don't need it, we have tmux.
00:22:09It's the same thing.
00:22:11And no built-in to-dos, write a to-do MD.
00:22:13Same thing.
00:22:14Or build all of this yourself the way you like it.
00:22:17And this is what Py allows you, by being super extensible.
00:22:21So you can extend tools, custom.
00:22:22You can give the LLM tools that you define.
00:22:26I think no other coding agent, Harness,
00:22:28currently offers that, unless you fork open code.
00:22:31You don't need to here.
00:22:32You just write a simple TypeScript file,
00:22:34and it gets loaded automatically.
00:22:37You can also write custom UI.
00:22:39Skills are obviously in their prompt templates, themes.
00:22:43And you can bundle all of that up, put it on MPM or Git,
00:22:46and install it with a single command, which is very nice.
00:22:49And everything hot reloads.
00:22:51So I developed my own extensions that
00:22:53are project or task specific in Py inside the project.
00:22:59And as the agent modifies the extension, I just reload.
00:23:05And it immediately updates all of the running code,
00:23:10which is very nice.
00:23:11And in practice, that means you can do custom compaction.
00:23:14I think that's one of the things that people should experiment
00:23:16more, because all of the compaction implementations
00:23:19currently are not good.
00:23:21Permission gates, you can easily implement them
00:23:23in 50 lines of code, and kind of cover
00:23:24what all the other agent harnesses do if you want that.
00:23:27Custom providers, register proxies of self-hosted models.
00:23:31Don't care.
00:23:32You don't need me to do this for you.
00:23:33You can do this, and actually, your clangor can do it for you.
00:23:37Or overwrite any built-in tool.
00:23:38Modify how read, write, edit, and bash work.
00:23:41Don't care.
00:23:42I have a version of read, write, edit, and bash
00:23:43that works through SSH on a remote machine.
00:23:47For me, that took five minutes to implement, but it works.
00:23:51And you have full TUI access, so you can actually
00:23:54write entirely custom UI in the coding agent.
00:23:58Cloud Code Shipped/By the Way, it took five minutes for somebody
00:24:02to replicate that in Py with more features.
00:24:05PyMessenger, I have no idea what it's doing,
00:24:07but apparently, it's like a chat room for multiple Py agents
00:24:10that then communicate, which then has custom UI.
00:24:13We can look what they're doing, and yeah, it just works.
00:24:18Or PyMess, if you're bored, just play a game
00:24:23while the agent is running, right?
00:24:24You can do that.
00:24:25Or PyAnnotate, open up the website
00:24:28you're working on currently, and annotate stuff in the front end,
00:24:31and give feedback to the agent directly in line.
00:24:35Feed it back into the context, have it modify the thing.
00:24:39Or something I use is File Switch It.
00:24:42I don't want to switch over to an IDE or editor.
00:24:43I just want to quickly look at the file that's been modified.
00:24:46So all of this is extensions.
00:24:48None of this is built in, and it takes people
00:24:50usually a couple of minutes to an afternoon
00:24:52to build all of this the way they want it to.
00:24:56PyWavic says, also, don't know what it's doing.
00:25:00Py also comes with tree structure.
00:25:01I'm not going to explain that.
00:25:03Just look at py.dev.
00:25:04Your session is a tree, not a linear list of chats.
00:25:07So you can basically do some agents
00:25:09by read all the files in the directory,
00:25:11summarize this, go back to my root of the conversation,
00:25:14take the summary with me, and do the actual work.
00:25:19Nothing is injected behind your back.
00:25:22Agents, skills, full cost tracking.
00:25:24A lot of harnesses don't do this here.
00:25:26Open code does it not well.
00:25:29HTML export, JSON format, headless JSON stream, blah, blah.
00:25:33Does it actually work?
00:25:34Well, terminal bench.
00:25:35Let me zoom in here.
00:25:36I can't.
00:25:37This is amazing.
00:25:38Here's py right behind terminus 2 using cloud opus 4.5.
00:25:45That was back in October where py didn't even have compaction.
00:25:49Demo time, skipping that, right against the clankers
00:25:51because they're breaking open source.
00:25:54If you're associated with this guy's project,
00:25:56then you will have hundreds of people coming from OpenClaw
00:26:02to your repository and span you with clanker, fill, fence law.
00:26:06So I had to invent a couple of measures.
00:26:09I invented OSS vacation.
00:26:11So I just closed issues and PRs for a couple of weeks
00:26:14and work on things on my own.
00:26:16Anything that's important will be reported later on anyways
00:26:20or in the Discord.
00:26:21And then I also implemented a custom access kind of scheme
00:26:26where I have a markdown file in the repository.
00:26:28If somebody opens a PR without their account name
00:26:32being in that markdown file, the PR gets auto-closed.
00:26:34I don't care.
00:26:35First, introduce yourself in a human voice via an issue.
00:26:39Write an issue that's not longer than the display law
00:26:42because everything else is clankers law probably.
00:26:45And once you did that, I'm happy to-- looks good to me, you.
00:26:47So you get into that file and can now submit PRs
00:26:50to the repository.
00:26:51All I'm asking is human verification.
00:26:53And Mitchell from Ghosty then took this and built
00:26:57a project called Vouch, which is more easily applicable
00:27:00to your own open source repositories.
00:27:02And that is Pi.
00:27:03Go forth and try it.
00:27:05That's it for me.
00:27:06[APPLAUSE]
00:27:07[MUSIC PLAYING]

Key Takeaway

Building a malleable, 600-line minimal core for coding agents outperforms feature-heavy 'spaceship' harnesses by prioritizing context transparency, prompt cache stability, and user-defined tool extensibility.

Highlights

Most coding agents inject hidden text into LLM prompts behind the user's back, causing unpredictable behavior when tool vendors update their internal prompts.

The Terminus agent performs at the top of the Terminal Bench leaderboard using only a TMUX session and raw keystrokes, proving that complex 'spaceship' features are often unnecessary for high performance.

OpenCode's session compaction prunes all results before the last 40,000 tokens on every turn, which negatively impacts prompt caching efficiency and infrastructure stability.

Pi consists of a minimal 600-line terminal user interface and a system prompt of only 457 tokens because frontier models are already heavily RL-trained to understand coding tasks.

Custom extensions in Pi, such as SSH-based file tools or custom UI for front-end annotation, can be implemented in as little as five minutes using TypeScript.

The 'Vouch' project and human verification files in repositories prevent 'clanker spam' by auto-closing pull requests from unverified AI-automated accounts.

Timeline

The Evolution and Failure of Coding Agent UX

  • Coding tools evolved from simple ChatGPT copy-pasting in 2023 to agentic search models like Claude Code in 2025.
  • Feature bloat has turned leading tools into 'spaceships' where users only understand 10% of the functionality.
  • The shift toward 'vibe coding' prioritizes automated features over the developer's need for predictable and observable workflows.

Early tools like the original GitHub Copilot often recited GPL code or failed to integrate with complex logic. Claude Code revolutionized the genre by using reinforcement learning to let models explore codebases ad-hoc with bash tools. However, as these tools added more features, they became less predictable, leading to a loss of control for experienced engineers who value simple, stable tools.

Technical Debt and Infrastructure Abuse in Open Source Agents

  • Hidden prompt injections by tool vendors break existing developer workflows during silent updates.
  • Enabling Language Server Protocol (LSP) feedback during active editing sessions confuses LLMs because code rarely compiles mid-edit.
  • Architecture flaws in existing open-source harnesses lead to security vulnerabilities like default remote code execution.

Tools like OpenCode suffer from architectural issues where every message becomes a separate JSON file on disk, indicating a lack of long-term planning. Forcing LSP feedback onto an agent while it is in the middle of a multi-file edit provides 'false' error signals that cause the model to give up. Furthermore, constant context manipulation and inefficient token pruning strategies create friction between tool developers and LLM providers like Anthropic.

The Minimalist Performance of Terminal Bench

  • The Terminus agent achieves top-tier results on the Terminal Bench leaderboard using only a bare-bones TMUX interface.
  • Minimalist interfaces prove that sub-agents, web search, and complex file tools are not requirements for agentic success.
  • Current industry standards for the 'perfect' coding harness remain undefined as the field oscillates between autonomy and minimalism.

Terminal Bench evaluates agents on 82 diverse tasks ranging from window setup to Monte Carlo simulations. The success of the Terminus agent, which only sends keystrokes and reads VT sequences, challenges the necessity of 'spaceship' features. This evidence suggests that the industry is still in a 'messing around and finding out' stage where the ideal balance of human-agent interaction is yet to be established.

Pi: A Malleable and Extensible Coding Architecture

  • Pi utilizes a minimal core consisting of an AI abstraction layer, an agent loop, and a 600-line TUI.
  • The system prompt is stripped to 457 tokens to leverage the model's inherent training rather than redundant instructions.
  • The architecture replaces built-in features with user-defined TypeScript extensions that hot-reload instantly.

Instead of built-in 'plan modes' or 'to-do' trackers, Pi encourages users to use persistent Markdown files or TMUX for sub-agent observation. This approach ensures all inputs and outputs are visible to the human operator. Extensions like PyAnnotate or SSH-based file tools allow the agent to be adapted to specific project environments without forking the entire codebase.

Protecting Open Source from AI Automation Spam

  • Automated 'clanker' accounts frequently spam open-source repositories with AI-generated issues and pull requests.
  • Human verification via 'Vouch' or account whitelisting prevents repository exhaustion.
  • Mandating a human voice in issue descriptions filters out low-effort automated contributions.

The rise of coding agents has led to a surge in 'clanker filth'—automated PRs that overwhelm maintainers. Implementing an 'OSS vacation' or auto-closing PRs from accounts not listed in a verification markdown file restores order. This strategy forces contributors to introduce themselves as humans before their code is considered, preserving the integrity of the development process.

Community Posts

View all posts