No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer

AAI Engineer
Computing/SoftwareManagementInternet Technology

Transcript

00:00:00(upbeat music)
00:00:02- Hi everybody, how y'all doing?
00:00:23It's exciting, I'm Dex.
00:00:25As they did in the great intro,
00:00:27I've been hacking on agents for a while.
00:00:29Our talk, 12-Factor Agents, said AI Engineer in June
00:00:32was one of the top talks of all time.
00:00:34I think top eight or something,
00:00:35one of the best ones from AI Engineer in June.
00:00:38May or may not have said something about context engineering.
00:00:41Why am I here today, what am I here to talk about?
00:00:44I wanna talk about one of my favorite talks
00:00:46from AI Engineer in June,
00:00:47and I know we all got the update from Igor yesterday,
00:00:49but they wouldn't let me change my slides,
00:00:50so this is gonna be about what Igor talked about in June.
00:00:54Basically, that they surveyed 100,000 developers
00:00:56across all company sizes,
00:00:58and they found that most of the time
00:01:00you use AI for software engineering,
00:01:01you're doing a lot of rework, a lot of code-based churn.
00:01:04And it doesn't really work well for complex tasks,
00:01:07brownfield code bases.
00:01:08And you can see in the chart, basically,
00:01:10you are shipping a lot more,
00:01:11but a lot of it is just reworking the slop
00:01:14that you shipped last week.
00:01:15So, and then the other side, right, was that
00:01:18if you're doing greenfield, little Vercel dashboard,
00:01:21something like this, then it's gonna work great.
00:01:25If you're going to go in a 10-year-old job at code base,
00:01:28maybe not so much.
00:01:29And this matched my experience.
00:01:30Personally, and talking to a lot of founders
00:01:32and great engineers, too much slop,
00:01:35tech debt factories, it's not gonna work from our code base.
00:01:37Maybe someday, when the models get better.
00:01:40But that's what context engineering is all about.
00:01:42How can we get the most out of today's models?
00:01:44How do we manage our context window?
00:01:46So we talked about this in August.
00:01:48I have to confess something.
00:01:49The first time I used Cloud Code, I was not impressed.
00:01:53It was like, okay, this is a little bit better,
00:01:54I get it, I like the UX.
00:01:56But since then, we as a team figured something out
00:01:59that we were actually able to get
00:02:01two to three X more throughput.
00:02:02And we were shipping so much that we had no choice
00:02:06but to change the way we collaborated.
00:02:07We rewired everything about how we build software.
00:02:11It was a team of three, it took eight weeks,
00:02:12it was really frickin' hard.
00:02:14But now that we solved it, we're never going back.
00:02:16This is the whole no slop thing.
00:02:18I think we got somewhere with this.
00:02:20Went super viral on Hacker News in September.
00:02:23We have thousands of folks who have gone onto GitHub
00:02:25and grabbed our research plan implement prompt system.
00:02:28So the goals here, which we kind of backed our way into,
00:02:31we need AI that can work well in brownfield code bases.
00:02:35That can solve complex problems.
00:02:38No slop, right, no more slop.
00:02:40And we had to maintain mental alignment.
00:02:42I'll talk a little bit more about
00:02:43what that means in a minute.
00:02:44And of course, we want to spend, with everything,
00:02:46we want to spend as many tokens as possible.
00:02:47What we can offload meaningfully to the AI
00:02:50is really, really important.
00:02:52Super high leverage.
00:02:53So this is advanced context engineering for coding agents.
00:02:56I'll start with kind of like framing this.
00:02:58The most naive way to use a coding agent
00:03:01is to ask it for something and then tell it why it's wrong
00:03:03and re-steer it and ask and ask and ask
00:03:05until you run out of context or you give up or you cry.
00:03:09We can be a little bit smarter about this.
00:03:11Most people discover this pretty early on
00:03:13in their AI like exploration is that it might be better
00:03:17if you start a conversation and you're off track
00:03:21that you just start a new context window.
00:03:24You say, okay, we went down that path.
00:03:25Let's start again.
00:03:26Same prompt, same task.
00:03:27But this time, we're gonna go down this path.
00:03:29And like don't go over there 'cause that doesn't work.
00:03:31So how do you know when it's time to start over?
00:03:34If you see this,
00:03:37(audience laughing)
00:03:39it's probably time to start over, right?
00:03:41This is what Claude says when you tell it it's screwing up.
00:03:45So we can be even smarter about this.
00:03:47We can do what I call intentional compaction.
00:03:50And this is basically whether you're on track or not,
00:03:53you can take your existing context window
00:03:56and ask the agent to compress it down into a markdown file.
00:03:59You can review this, you can tag it.
00:04:00And then when the new agent starts,
00:04:02it gets straight to work instead of having to do
00:04:04all that searching and code-based understanding
00:04:05and getting caught up.
00:04:07What goes into compaction?
00:04:09The question is like what takes up space
00:04:11in your context window?
00:04:13So it's looking for files, it's understanding code flow,
00:04:17it's editing files, it's test and build output.
00:04:20And if you have one of those MCPs that's dumping JSON
00:04:22and a bunch of UUIDs in your context window,
00:04:25God help you.
00:04:26So what should we compact?
00:04:28I'll get more on the specifics here,
00:04:30but this is a really good compaction.
00:04:31This is exactly what we're working on,
00:04:33the exact files and line numbers
00:04:34that matter to the problem that we're solving.
00:04:37Why are we so obsessed with context?
00:04:39Because LLMs actually got roasted on YouTube for this one.
00:04:42They're not pure functions 'cause they're nondeterministic,
00:04:45but they are stateless.
00:04:46And the only way to get better performance out of an LLM
00:04:49is to put better tokens in
00:04:51and then you get better tokens out.
00:04:52And so every turn of the loop,
00:04:53one clot is picking the next tool
00:04:55or any coding agent is picking the next tool.
00:04:56And there could be hundreds of right next steps
00:04:58and hundreds of wrong next steps.
00:05:00But the only thing that influences what comes out next
00:05:03is what is in the conversation so far.
00:05:05So we're gonna optimize this context window
00:05:07for correctness, completeness, size,
00:05:10and a little bit of trajectory.
00:05:11And the trajectory one is interesting
00:05:12because a lot of people say,
00:05:13well, I told the agent to do something
00:05:16and it did something wrong.
00:05:17So I corrected it and I yelled at it
00:05:18and then it did something wrong again.
00:05:20And then I yelled at it.
00:05:21And then the LLM is looking at this conversation,
00:05:23says, okay, cool, I did something wrong
00:05:24and the human yelled at me
00:05:25and I did something wrong and the human yelled at me.
00:05:26So the next most likely token in this conversation
00:05:29is I better do something wrong
00:05:31so the human can yell at me again.
00:05:33So be mindful of your trajectory.
00:05:35If you were gonna invert this,
00:05:36the worst thing you can have is incorrect information,
00:05:39then missing information, and then just too much noise.
00:05:42If you like equations, there's a dumb equation
00:05:44if you wanna think about it this way.
00:05:47Jeff Huntley did a lot of research on coding agents.
00:05:51He put it really well.
00:05:51Just the more you use the context window,
00:05:53the worse outcomes you'll get.
00:05:55This leads to a concept.
00:05:56I'm in a very, very academic concept called the dumb zone.
00:05:59So you have your context window.
00:06:01You have 168,000 tokens roughly.
00:06:03Some are reserved for output and compaction.
00:06:05This varies by model,
00:06:07but we use Cloud Code as an example here.
00:06:09Around the 40% line is where you're gonna start
00:06:10to see some diminishing returns depending on your task.
00:06:14If you have too many MCPs in your coding agents,
00:06:17you are doing all your work in the dumb zone
00:06:18and you're never gonna get good results.
00:06:21People talked about this.
00:06:21I'm not gonna talk about that one.
00:06:22Your mileage may vary.
00:06:2340% is like, it depends on how complex the task is,
00:06:26but this is kind of a good guideline.
00:06:28So back to compaction, or as I will call it from now on,
00:06:31cleverly avoiding the dumb zone.
00:06:33We can do subagents.
00:06:37If you have a front-end subagent and a back-end subagent
00:06:39and a QA subagent and a data scientist subagent, please stop.
00:06:44Subagents are not for anthropomorphizing roles.
00:06:47They are for controlling context.
00:06:49And so what you can do is if you wanna go find
00:06:51how something works in a large code base,
00:06:53you can steer the coding agent to do this
00:06:55if it supports subagents,
00:06:56or you can build your own subagent system,
00:06:58but basically you say, hey, go find how this works.
00:07:00And it can fork out a new context window
00:07:03that is gonna go do all that reading and searching
00:07:05and finding and reading entire files
00:07:07and understanding the code base,
00:07:09and then just return a really, really succinct message
00:07:13back up to the parent agent of just like,
00:07:14hey, the file you want is here.
00:07:17Parent agent can read that one file and get straight to work.
00:07:20And so this is really powerful.
00:07:22If you wield these correctly,
00:07:23you can get good responses like this,
00:07:25and then you can manage your context really, really well.
00:07:29What works even better than subagents
00:07:30or a layer on top of subagents
00:07:32is a workflow I call frequent intentional compaction.
00:07:35We're gonna talk about research plan implement in a minute,
00:07:37but the point is you're constantly
00:07:39keeping your context window small.
00:07:41You're building your entire workflow around context management
00:07:45so it comes in three phases, research, plan, implement,
00:07:48and we're gonna try to stay in the smart zone the whole time.
00:07:51So the research is all about understanding
00:07:53how the system works, finding the right file,
00:07:55staying objective.
00:07:56Here's a prompt you can use to do research.
00:07:58Here's the output of a research prompt.
00:08:00These are all open source.
00:08:01You can go grab them and play with them yourself.
00:08:04Planning, you're gonna outline the exact steps.
00:08:06You're gonna include file names and line snippets.
00:08:08You're gonna be very explicit about how we're gonna test things
00:08:10after every change.
00:08:11Here's a good planning prompt.
00:08:12Here's one of our plans.
00:08:13It's got actual code snippets in it.
00:08:16And then we're gonna implement.
00:08:17And if you've read one of these plans,
00:08:17you can see very easily how the dumbest model in the world
00:08:20is probably not gonna screw this up.
00:08:23So we just go through and we run the plan
00:08:24and we keep the context low.
00:08:26As a planning prompt, like I said,
00:08:27it's the least exciting part of the process.
00:08:30I wanted to put this into practice.
00:08:31So working for us, I do a podcast with my buddy Vaibhav
00:08:34who's the CEO of a company called Boundary ML.
00:08:37And I said, "Hey, I'm gonna try to one shot a fix
00:08:39"to your 300,000 line Rust code base
00:08:41"for a programming language."
00:08:42And the whole episode goes in.
00:08:45It's like an hour and a half.
00:08:46I'm not gonna talk through it right now,
00:08:47but we built a bunch of research
00:08:48and then we threw them out 'cause they were bad.
00:08:49And then we made a plan and we made a plan without research
00:08:51and with research and compared all the results.
00:08:53It's a fun time.
00:08:54That was Monday night.
00:08:55By Tuesday morning, we were on the show
00:08:57and the CTO had seen the PR
00:08:59and didn't realize I was doing it as a bit for a podcast.
00:09:03And basically was like, "Yeah, this looks good.
00:09:04"We'll get it in the next release."
00:09:05He was a little confused.
00:09:08Here's the plan.
00:09:09But anyways, yeah, confirmed.
00:09:12Works in Brownfield code bases and no slop.
00:09:14But I wanted to see if we could solve complex problems.
00:09:17So Vaibhav was still a little skeptical.
00:09:19I sat down, we sat down for like seven hours on a Saturday
00:09:21and we shipped 35,000 lines of code to BAML.
00:09:24One of the PRs got merged like a week later.
00:09:26I will say some of this is code gen.
00:09:28You update your behavior,
00:09:29all the golden files update and stuff,
00:09:31but we shipped a lot of code that day.
00:09:33He estimates there's about one to two weeks in seven hours.
00:09:36And so cool, we can solve complex problems.
00:09:40There are limits to this.
00:09:41I sat down with my buddy Blake.
00:09:42We tried to remove Hadoop dependencies from Parquet Java.
00:09:46If you know what Parquet Java is,
00:09:47I'm sorry for whatever happened to you
00:09:50to get you to this point in your career.
00:09:53It did not go well.
00:09:55Here's the plans, here's the research.
00:09:57At a certain point, we threw everything out
00:09:58and we actually went back to the whiteboard.
00:10:00We had to actually, once we had learned
00:10:01where all the foot guns were,
00:10:03we went back to, okay,
00:10:05how is this actually gonna fit together?
00:10:06And this brings me to a really interesting point
00:10:09that Jake's gonna talk about later.
00:10:11Do not outsource the thinking.
00:10:13AI cannot replace thinking.
00:10:14It can only amplify the thinking you have done
00:10:17or the lack of thinking you have done.
00:10:19So people ask, so Dex,
00:10:21this is spec-driven development, right?
00:10:23No, spec-driven development is broken.
00:10:27Not the idea, but the phrase.
00:10:30It's not well-defined.
00:10:33This is Brigetta from ThoughtWorks.
00:10:35And a lot of people just say spec
00:10:37and they mean a more detailed prompt.
00:10:39Does anyone remember this picture?
00:10:41Does anyone know what this is from?
00:10:43All right, that's a deep cut.
00:10:44There will never be a year of agents
00:10:46because of semantic diffusion.
00:10:47Martin Fowler said this in 2006.
00:10:49We come up with a good term with a good definition
00:10:52and then everybody gets excited
00:10:53and everybody starts meaning it to mean 100 things
00:10:56to 100 different people and it becomes useless.
00:10:59We had an agent as a person, an agent as a microservice,
00:11:02an agent as a chatbot, an agent as a workflow.
00:11:05And thank you, Simon.
00:11:06We're back to the beginning.
00:11:07An agent is just tools in a loop.
00:11:09This is happening to spec-driven dev.
00:11:11I used to have Sean's slide in the beginning of this talk,
00:11:15but it caused a bunch of people
00:11:15to focus on the wrong things.
00:11:17His thing of forget the code, it's like assembly now
00:11:19and you just focus on the markdown.
00:11:21Very cool idea, but people say spec-driven dev
00:11:24is writing a better prompt, a product requirements document.
00:11:26Sometimes it's using verifiable feedback loops
00:11:28and back pressure.
00:11:30Maybe it is treating the code like assembly,
00:11:32like Sean taught us.
00:11:34But a lot of people is just using a bunch of markdown files
00:11:36while you're coding.
00:11:37Or my favorite, I just stumbled upon this last week,
00:11:39a spec is documentation for an open source library.
00:11:43So it's gone.
00:11:44As spec-driven dev is overhyped, it's useless now.
00:11:48It's semantically diffused.
00:11:49So I wanted to talk about four things
00:11:52that actually work today, the tactical and practical steps
00:11:55that we found working internally and with a bunch of users.
00:11:59We do the research, we figure out how the system works.
00:12:02You remember "Memento"?
00:12:03This is the best movie on context engineering,
00:12:05as Peter says it.
00:12:07This guy wakes up, he has no memory,
00:12:09he has to read his own tattoos to figure out who he is
00:12:11and what he's up to.
00:12:12If you don't onboard your agents, they will make stuff up.
00:12:17And so this is your team, this is very simplified
00:12:19for most of you.
00:12:19Most of you have much bigger orgs than this.
00:12:21But let's say you want to do some work over here.
00:12:23One thing you could do is you could put onboarding
00:12:26into every repo.
00:12:27You put a bunch of context.
00:12:28Here's the repo, here's how it works.
00:12:29This is a compression of all the context in the code base
00:12:32that the agent can see ahead of time
00:12:34before actually getting to work.
00:12:36This is challenging because sometimes it gets too long.
00:12:39As your code base gets really big,
00:12:41you either have to make this longer
00:12:43or you have to leave information out.
00:12:45And so as you are reading through this,
00:12:48you're gonna read the context
00:12:49of this big five million line mono repo
00:12:52and you're gonna use all the smart zone
00:12:53just to learn how it works and you're not gonna be able
00:12:55to do any good tool calling in the dumb zone.
00:12:57So that's, you can shard this down the stack.
00:13:02You can do, they're just talking about progressive disclosure.
00:13:04You could split this up, right?
00:13:05You could just put a file in the root of every repo
00:13:08and then at every level you have additional context
00:13:11based on if you're working here,
00:13:13this is what you need to know.
00:13:15We don't document the files themselves
00:13:17'cause they're the source of truth.
00:13:18But then as your agent is working,
00:13:19you pull in the root context
00:13:21and then you pull in the sub context.
00:13:22We won't talk about any specific,
00:13:23you could use CloudMD for this,
00:13:24you can use Hoax for this, whatever it is.
00:13:26But then you still have plenty of room in the smart zone
00:13:28'cause you're only pulling in what you need to know.
00:13:31The problem with this is that it gets out of date.
00:13:33And so every time you ship a new feature,
00:13:35you need to kind of like cache and validate
00:13:38and rebuild large parts of this internal documentation.
00:13:42And you could use a lot of AI
00:13:43and make it part of your process to update this.
00:13:46Why don't I ask a question?
00:13:48Between the actual code, the function names,
00:13:50the comments and the documentation,
00:13:51does anyone wanna guess what is on the y-axis of this chart?
00:13:57- Slop. - Slop.
00:13:58It's actually the amount of lies you can find
00:14:01in any one part of your code base.
00:14:03So you could make it part of your process to update this,
00:14:06but you probably shouldn't 'cause you probably won't.
00:14:08What we prefer is on-demand compressed context.
00:14:11So if I'm building a feature that relates to SCM providers
00:14:14and JIRA and linear,
00:14:15I would just give it a little bit of steering.
00:14:17I would say, hey, we're going over
00:14:18in like this part of the code base over here
00:14:21and a good research prompt or slash command
00:14:24might take your skill even,
00:14:27launch a bunch of subagents to take these vertical slices
00:14:30through the code base and then build up a research document
00:14:33that is just a snapshot of the actually true
00:14:35based on the code itself, parts of the code base that matter.
00:14:39We are compressing truth.
00:14:41Planning is leverage.
00:14:43Planning is about compression of intent.
00:14:45And in plan, we're gonna outline the exact steps.
00:14:48Let's take our research and our PRD or our bug ticket
00:14:50or our whatever it is.
00:14:52We create a plan and we create a plan file.
00:14:54So we're compacting again.
00:14:55And I wanna pause and talk about mental alignment.
00:14:58Does anyone know what code review is for?
00:15:00Mental alignment, mental alignment.
00:15:05It is about finding, making sure things are correct and stuff.
00:15:08But the most important thing is how do we keep everybody
00:15:10on the team on the same page
00:15:11about how the code base is changing and why?
00:15:14And I can read a thousand lines of Golang every week.
00:15:17Sorry, I can't read a thousand.
00:15:18It's hard, I can do it.
00:15:19I don't want to.
00:15:20And as our team grows, all the code gets reviewed.
00:15:23We don't not read the code.
00:15:24But I, as a technical leader on the team,
00:15:27I can read the plans and I can keep up to date.
00:15:29And I can, that's enough.
00:15:30I can catch some problems early
00:15:32and I maintain understanding of how the system is evolving.
00:15:35Mitchell had this really good post
00:15:36about how he's been putting his AMP threads
00:15:38on his pull requests so that you can see not just,
00:15:41hey, here's a wall of green text in GitHub,
00:15:43but here's the exact steps, here's the prompts,
00:15:44and hey, I ran the build at the end and it passed.
00:15:46This takes the reviewer on a journey
00:15:49in a way that a GitHub PR just can't.
00:15:51And as you're shipping more and more
00:15:52than two to three times as much code,
00:15:54it's really on you to find ways to keep your team
00:15:57on the same page and show them here's the steps I did
00:16:00and here's how we tested it manually.
00:16:01Your goal is leverage, so you want high confidence
00:16:04that the model will actually do the right thing.
00:16:06I can't read this plan and know what actually
00:16:08is gonna happen and what code changes are gonna happen.
00:16:11So we've, over time, iterated towards our plans include
00:16:14actual code snippets of what's gonna change.
00:16:17So your goal is leverage.
00:16:18You want compression of intent
00:16:19and you want reliable execution.
00:16:22And so, I don't know, I have a physics background.
00:16:23We like to draw lines through the center of peaks and curves.
00:16:28As your plans get longer, reliability goes up,
00:16:30readability goes down.
00:16:31There's a sweet spot for you and your team
00:16:33and your code base, you should try to find it.
00:16:35Because when we review the research and the plans,
00:16:37if they're good, then we can get mental alignment.
00:16:40Don't outsource the thinking.
00:16:42I've said this before, this is not magic.
00:16:44There is no perfect prompt.
00:16:46You still, it will not work if you do not read the plan.
00:16:50So we built our entire process around you, the builder,
00:16:53are in back and forth with the agent,
00:16:55reading the plans as they're created.
00:16:56And then if you need peer review,
00:16:58you can send it to someone and say,
00:16:58hey, does this plan look right?
00:17:00Is this the right approach?
00:17:00Is this the right order to look at these things?
00:17:03Jake, again, wrote a really good blog post about
00:17:05the thing that makes research plan implementing valuable
00:17:07is you, the human, in the loop, making sure it's correct.
00:17:11So if you take one thing away from this talk,
00:17:14it should be that a bad line of code is a bad line of code.
00:17:17And a bad part of a plan could be 100 bad lines of code.
00:17:22And a bad line of research, like a misunderstanding
00:17:25of how the system works and where things are,
00:17:27your whole thing's gonna be hosed.
00:17:29You're gonna be sending the model off in the wrong direction.
00:17:31And so when we're working internally and with users,
00:17:34we're constantly trying to move human effort and focus
00:17:36to the highest leverage parts of this pipeline.
00:17:39Don't outsource the thinking.
00:17:41Watch out for tools that just spew out
00:17:43a bunch of markdown files just to make you feel good.
00:17:45I'm not gonna name names here.
00:17:47Sometimes this is overkill.
00:17:49And the way I like to think about this is like,
00:17:51yeah, you don't always need a full research plan implement.
00:17:54Sometimes you need more, sometimes you need less.
00:17:56If you're changing the color of a button,
00:17:57just talk to the agent and tell them what to do.
00:18:00If you're doing a simple plan and it's a small feature,
00:18:04if you're doing medium features across multiple repos,
00:18:07then do one research, then build a plan.
00:18:09Basically, the hardest problem you can solve,
00:18:10the ceiling goes up, the more of this context engineering
00:18:13compaction you're willing to do.
00:18:15And so if you're in the top right corner,
00:18:18you're probably gonna have to do more.
00:18:19A lot of people ask me, how do I know
00:18:21how much context engineering to use?
00:18:23It takes reps.
00:18:24You will get it wrong, you have to get it wrong
00:18:26over and over and over again.
00:18:27Sometimes you'll go too big, sometimes you'll go too small.
00:18:29Pick one tool and get some reps.
00:18:32I recommend against min-maxing across Clod and Codex
00:18:35and all these different tools.
00:18:36So I'm not a big acronym guy.
00:18:40We said spec driven dev was broken.
00:18:42Research plan and implement I don't think will be the steps.
00:18:44The important part is compaction and context engineering
00:18:47and staying in the smart zone.
00:18:48But people are calling this RPI
00:18:50and there's nothing I can do about it.
00:18:52So just be wary, there is no perfect prompt,
00:18:55there is no silver bullet.
00:18:56If you really want a hype-y word,
00:18:58you can call this harness engineering,
00:19:00which is part of context engineering
00:19:01and it's how you integrate with the integration points
00:19:03on Codex, Clod, Cursor, whatever,
00:19:05how you customize your code base.
00:19:07So what's next?
00:19:11I think the coding agent stuff is actually
00:19:12gonna be commoditized.
00:19:13People are gonna learn how to do this and get better at it.
00:19:15And the hard part is gonna be how do you adapt your team
00:19:17and your workflow in the SDLC to work in a world
00:19:21where 99% of your code is shipped by AI.
00:19:24And if you can't figure this out, you're hosed.
00:19:26Because there's kind of a rift growing
00:19:27where staff engineers don't adopt AI
00:19:29because it doesn't make them that much faster
00:19:31and then junior mid-levels engineers use a lot
00:19:33'cause it fills in skill gaps
00:19:35and then it also produces some slop
00:19:36and then the senior engineers hate it more and more
00:19:38every week because they're cleaning up slop,
00:19:40it was shipped by Cursor the week before.
00:19:42This is not AI's fault,
00:19:44this is not the mid-level engineer's fault.
00:19:46Cultural change is really hard
00:19:48and it needs to come from the top if it's gonna work.
00:19:50So if you're a technical leader of your company,
00:19:52pick one tool and get some reps.
00:19:54If you wanna help, we are hiring,
00:19:56we're building an agentic IDE to help teams of all sizes
00:19:59speed run the journey to 99% AI-generated code.
00:20:03We'd love to talk if you wanna work with us.
00:20:06Go hit our website, send us an email,
00:20:08come find me in the hallway.
00:20:09Thank you all so much for your energy.
00:20:11(audience applauds)
00:20:13(upbeat electronic music)

Description

It seems pretty well-accepted that AI coding tools struggle with real production codebases. At AI Engineer 2025 in June, The Stanford study on AI's impact on developer productivity found: A lot of the ""extra code"" shipped by AI tools ends up just reworking the slop that was shipped last week. Coding agents are great for new projects or small changes, but in large established codebases, they can often make developers less productive. The common response is somewhere between the pessimist ""this will never work"" and the more measured ""maybe someday when there are smarter models."" After several months of tinkering, we've found that you can get really far with today's models if you embrace core context engineering principles. This isn't another ""10x your productivity"" pitch. I tend to be pretty measured when it comes to interfacing with the ai hype machine. But we've stumbled into workflows that leave me with considerable optimism for what's possible. We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review. We use a family of techniques I call ""frequent intentional compaction"" - deliberately structuring how you feed context to the AI throughout the development process. In this talk, I'll share what we've learned since first sharing these techniques back in August, and some educated predictions on what's coming in the next 6-12 months for software engineers. Speaker: twitter.com/dexhorthy Timestamps: 00:00 intro: complex code 01:40 context engineering 02:53 advanced context 04:38 context obsession 05:55 dumb zone concept 07:26 context management 09:37 complex problem solved 10:45 semantic diffusion 12:14 onboarding agents ‍ 13:57 internal docs lies 15:03 mental alignment key 16:12 code snippet plans 17:38 don't outsource think 18:45 rpi: smart zone 19:46 cultural change hard ‍‍ Hey - I'm Dex, and I'm hacking on getting AI coding agents to solve hard problems in complex codebases at HumanLayer. Before this I was working on APIs for agent orchestration and Human-in-the-Loop, and wrote the April 2025 essay "12 factor agents" that first coined the term Context Engineering. I've been coding since high school, when I built tools for NASA researchers to navigate the south pole of the moon. Enjoyer of tacos and burpees (not necessarily in that order).

Community Posts

View all posts