Transcript

00:00:00(upbeat music)
00:00:02- Welcome to the Future of AI Coding panel.
00:00:04Thank you for reading the memo
00:00:05that you have to wear all black.
00:00:07(laughing)
00:00:09Okay, so I do want to cover a little bit of introductions.
00:00:12I know each of you in different ways,
00:00:15but maybe the audience, hopefully, doesn't quite.
00:00:17Matan, why don't you go first?
00:00:19What is Factory's position
00:00:24to the broader world in AI coding?
00:00:26- Yeah, so at Factory,
00:00:28our mission is to bring autonomy to software engineering.
00:00:32And what that means more concretely,
00:00:34we have built end-to-end software development agents
00:00:37called droids.
00:00:38They don't just focus on the coding itself,
00:00:40but really the entire end-to-end
00:00:42software development lifecycle.
00:00:43So things like documentation, testing, review,
00:00:48kind of all the ugly parts so that you can also do
00:00:51the more fun parts like the coding itself.
00:00:52And for the parts of the coding you don't want to do,
00:00:54you can also have the droids do that.
00:00:56So you build droids.
00:00:58You build droids.
00:00:59And OpenAI obviously needs some introduction,
00:01:02but your role on the codecs team,
00:01:05I saw you pop up on the codecs video.
00:01:08That's how I knew it was you working on it.
00:01:10But how do you think about codecs these days
00:01:13since it's expanded a lot?
00:01:14- Yeah, so earlier this year,
00:01:16we launched our first coding agent.
00:01:19I worked on codec CLI,
00:01:21bringing the power of our reasoning models
00:01:23into people's computers.
00:01:26Then we released codecs cloud where you could actually
00:01:28distribute and delegate those tasks to work in the cloud.
00:01:31And over the last some odd months,
00:01:33we've been unifying these experiences.
00:01:34So they work as seamlessly as possible.
00:01:36So a lot of our focus is around how do we make
00:01:38the fundamentals, the primitives as useful as possible.
00:01:41We just released a dev day codecs SDK.
00:01:43So I think one of the key directions we've been seeing
00:01:46is not just using coding or code executing agents for coding,
00:01:50but also for general purpose tasks.
00:01:52And so whether it was try to be the agent,
00:01:54which I worked on earlier this year
00:01:55that actually executes code in the background
00:01:57to accomplish some tasks,
00:01:59but starting to enable our developers to build on top of
00:02:02not just the reasoning models,
00:02:04but also things like sandboxing
00:02:05and all the other primitives that we built into codecs.
00:02:07- Awesome.
00:02:09V0?
00:02:09- Yeah, the goal of V0 is to enable developers
00:02:14to do preview driven agentic programming.
00:02:16So today when you build web apps,
00:02:19you probably have an agent open,
00:02:21your IDE open, so some kind of code,
00:02:23and then a preview of what you're actually building.
00:02:25Usually you're running dev server.
00:02:26With V0, our goal is to allow you to just have
00:02:28an agent running and directly prompt against your running app.
00:02:32And that's how we think the future of DX is gonna pan out.
00:02:35- Okay, awesome.
00:02:36And everyone has different surface areas
00:02:38in which to access your coding agents.
00:02:40So I think one of the things we kinda wanna kick off with is
00:02:43how important is local versus cloud?
00:02:45You started local with cloud,
00:02:47you started cloud with local, you're cloud only for now.
00:02:50What's the split?
00:02:52Is everyone just gonna merge eventually?
00:02:55- Yeah, so maybe I can start there.
00:02:58So I think at the end of the day,
00:02:59the point of these agents is that
00:03:02they are as helpful as possible
00:03:04and they have a very similar silhouette
00:03:06to that of a human that you might work with.
00:03:08And you don't have local humans and remote humans
00:03:11that are like somehow, you know,
00:03:13this one only works in this environment,
00:03:15this one only works in that environment.
00:03:16Generally, humans can be helpful
00:03:18whether you're in a meeting with them
00:03:19and you come up with an idea
00:03:20or you're sitting like shoulder to shoulder at a computer.
00:03:24So I guess asymptotically, these need to become the same,
00:03:28but I think in the short term,
00:03:29remote is typically, what we're seeing is it's typically
00:03:34more useful for smaller tasks that you're more confident
00:03:37that you can delegate reliably.
00:03:39Whereas local is when you wanna be
00:03:41a little bit closer to the agent,
00:03:43it's maybe some larger task or some more complicated task
00:03:46that you're gonna kind of actively be monitoring.
00:03:49And you want it to be local so that if something goes wrong,
00:03:52you don't need to pull that branch back down
00:03:54and then start working on it,
00:03:55but instead you're right there to guide it.
00:03:57- Yeah, maybe I'm just greedy, but I want both.
00:04:00And I think having a modality to Matan's point
00:04:04where I like to think about what are the primary forms
00:04:07of collaboration that I'm used to
00:04:08and I enjoy with my coworkers.
00:04:11Often that starts something like a whiteboarding session
00:04:13and maybe we're just like jamming on something in a room.
00:04:17When we were building, I think a good example
00:04:19was agents.md, which is our custom instructions
00:04:23intended to be generic across different coding agents.
00:04:26The way that it started was Romain and I
00:04:28were just in a room coming up with this idea.
00:04:31Then we just started whiteboarding and then took a photo
00:04:33and then kicked it off in codec CLI locally,
00:04:36just like in a workshop at Next.js app that we could work on,
00:04:40went to lunch, came back.
00:04:41It had a good amount of the kind of core structure.
00:04:44And then from there, we were able to iterate
00:04:45a little bit more closely.
00:04:46So having that kind of pairing
00:04:48and kind of brainstorm style experience.
00:04:49And then I think to that second point
00:04:51about what kind of tasks you delegate to,
00:04:54I think historically smaller, monarily scoped tasks
00:04:57where you're very clear about what the output is,
00:05:00is kind of the right modality
00:05:01if you're doing a fire and forget.
00:05:02But I think what we're starting to see with,
00:05:04we just launched GBD5 codecs about two months ago now.
00:05:08And I think one of the main differences
00:05:09is that it can actually do these longer running,
00:05:11more complex, more ambiguous tasks,
00:05:14as long as you are clear about what you want by the end.
00:05:16So it can work for hours at a time.
00:05:18I think that shift as models increase in capability
00:05:21will start to enable more kind of use cases.
00:05:24- Yeah.
00:05:24Yeah, I think there are three parts of making an agent work.
00:05:27There's the actual agent loop,
00:05:29there are the tool calls it makes,
00:05:30and then the resources upon which the tool calls need to act.
00:05:33Whether you go cloud or local first
00:05:35is based on where those resources are, right?
00:05:37If you're trying to work on a local file system,
00:05:39those are the resources you need to access.
00:05:41It totally makes sense
00:05:42that your agent loop should run locally, right?
00:05:44If you're accessing resources that typically exist in the cloud
00:05:46you're pulling from GitHub,
00:05:47directly from like third party repo of some kind,
00:05:51then it makes sense for your agent
00:05:52to start off in the cloud, right?
00:05:54Ultimately though, these resources exist in both places, right?
00:05:57Every developer expects an agent to be able to work
00:06:00both on the local file system,
00:06:02as well as on an open PR that might be hosted on GitHub.
00:06:04And so it doesn't really matter where you start, I think,
00:06:07everyone is converging at the same place,
00:06:08which is that your agent loop needs to be able to run anywhere,
00:06:11your tool calls need to be able to be streamed
00:06:13from the cloud locally or from a local backup to the cloud.
00:06:16And then it all depends on where the resources
00:06:18you actually want to act on are located.
00:06:20- Yeah, awesome.
00:06:22Okay, so we were chatting off stage
00:06:24and we were casting around for spicy questions and stuff.
00:06:27So I really liked this one and I think it's very topical.
00:06:31Do you guys generate slop as a living?
00:06:33Like are we in danger of potentially being in a hype bubble
00:06:40where we believe that this is like a sustainable path to AGI?
00:06:44- I mean, I think to start, you could say that one man's slop
00:06:48is another man's treasure, which to some extent might be true.
00:06:52Like, you know, if for example, you have, I don't know,
00:06:56like let's suppose you had a repo
00:06:58that had no documentation whatsoever.
00:07:00You could use, you know, many of the tools
00:07:04that we've been talking about to go and generate
00:07:06documentation for this repo.
00:07:08Now, is it gonna be the most like finely crafted
00:07:12piece of documentation?
00:07:13No, but is it providing alpha?
00:07:16Yes, in my mind, because having to like sift through
00:07:19some super old legacy code base that has no docs
00:07:22is a lot harder than looking through
00:07:23some somewhat sloppified documentation.
00:07:26And so I think the big thing is it's figuring out
00:07:29where you can use these tools for leverage
00:07:32and the degree to which it's slop,
00:07:35I think also kind of depends on how much guidance you provide.
00:07:38So if you just say like, build me an app that does this,
00:07:40like you're probably gonna get some generic slop app
00:07:43that does--
00:07:44- It's purple.
00:07:44- Yeah, blue, purple like fade, yeah.
00:07:48Whereas if instead you're like very methodical
00:07:50about exactly what it is that you want,
00:07:52you provided the tools to actually run tests
00:07:54to verify some of the capabilities that you're requesting.
00:07:58I think that makes it much more structured
00:08:00to a similar extent that if you were to, you know,
00:08:03hire some junior engineer onto your team
00:08:06and you just say, hey, go do this.
00:08:08Like they're probably gonna yield some like median outcome
00:08:11because they have no other specification to go off of.
00:08:14And it's pretty ambiguous like what you actually want done.
00:08:19- I think the key word there is leverage, right?
00:08:21Like what AI coding agents allow you to do
00:08:23is do 10X more than you would be able to do yourself
00:08:25with a pretty high floor, right?
00:08:27So if you plot skill level against how useful an agent is
00:08:30or how likely it is, you know,
00:08:31how useful it actually is in generating non-slop,
00:08:33there's probably a like pretty low floor
00:08:35if you have no skill.
00:08:36You have a pretty high floor still, right?
00:08:38Agents are pretty good just out of the box.
00:08:39If you don't know anything about development,
00:08:41the agent is gonna do much more than you could possibly do.
00:08:44But as you get to higher and higher skill levels,
00:08:46senior and principal and distinguished engineers
00:08:48actually use agents differently.
00:08:50They're using it to level up
00:08:51the things they could already do.
00:08:53You know, a principal engineer might be able to
00:08:55write manually 5,000 lines of code a day.
00:08:57With agents, they can write like 50,000 lines of code a day.
00:09:00And it really operates at the level of quality of the inputs
00:09:03and the knowledge that you put in there.
00:09:04So I think we're, you know, slowly raising the floor
00:09:07over time by, you know, building better agents.
00:09:11But I do think it's a form of leverage.
00:09:14It's a way for you to accelerate
00:09:16the kinds of things you can already do, do them faster.
00:09:18And for folks who don't have skills, you know,
00:09:20that's when you can actually really raise the floor
00:09:22of what it can be do.
00:09:23- Absolutely, and just to add on to both of these points,
00:09:26I think they're tools and amplifiers of craft.
00:09:29If you have it, you can do more of it.
00:09:31If you don't, it is just harder,
00:09:32but it does raise the floor.
00:09:34I think that's really worth calling out.
00:09:36I think for folks who are just trying
00:09:39to build their first prototype,
00:09:40they're trying to iterate an idea
00:09:42that example was mentioning earlier.
00:09:44It's not that like I couldn't make a front end
00:09:47that kind of is like a content-driven site,
00:09:50but I just didn't have time.
00:09:51And it was more fun to just draw on a whiteboard,
00:09:53talk, have a conversation, and then kick it off to an agent.
00:09:57But I think one of the interesting examples of this
00:09:58was when we were building much earlier iterations of codecs
00:10:01and well over a year ago.
00:10:03And we were putting in front of two different archetypes,
00:10:05folks who did a lot of product engineering
00:10:08where they're used to using local,
00:10:12in the inner loop style tools
00:10:14where they're used to just chatting and maybe iterating.
00:10:18And then a completely different modality
00:10:20when we talk to folks on the reasoning teams
00:10:23where they would sit for maybe five minutes
00:10:25just defining the task and have an essay length,
00:10:29like word problem for the agent to go off and do,
00:10:32and then it would work for an hour.
00:10:33And that was effectively 01 or earlier kind of versions of it.
00:10:37And I think the interesting part there
00:10:39was just the way that people would approach
00:10:41giving the task to the agent was completely different
00:10:44based on their understanding of what do they think it needs.
00:10:48And so I think really anchoring on specificity,
00:10:52being really clear about what you want the output to be.
00:10:55And I think there's a broader item
00:10:56that is a responsibility on both us as builders of agents
00:11:00and folks training models to really raise that floor
00:11:04and to ensure that the ceiling
00:11:06for people with high craftsmanship, with high taste
00:11:08are able to exercise that in the way that they see fit.
00:11:11- I think actually something that you've mentioned
00:11:13brought this idea to mind that we've started to notice.
00:11:16So our target audience is the enterprise.
00:11:19And something that we've seen occur time and again
00:11:21is that there's a very interesting bimodality
00:11:24in terms of adoption of agent native development.
00:11:28And in particular, normally earlier in career developers
00:11:32are more open-minded to start building
00:11:34in an agent native way,
00:11:36but they don't have the experience
00:11:38of managing engineering teams.
00:11:39So they're maybe not the most familiar with delegation
00:11:42in a way that works very well.
00:11:44Meanwhile, more experienced engineers
00:11:46have a lot of experience delegating.
00:11:47They know that, hey, if I don't specify these exact things,
00:11:50it won't get done.
00:11:51And so they're really good at like writing out that paragraph,
00:11:54but they're pretty stubborn
00:11:56and they actually don't wanna change the way that they build
00:11:59and you're gonna have to pry Emacs
00:12:01out of their cold dead hands.
00:12:03So it's an interesting balance there.
00:12:05- So funny you say that.
00:12:06Similar thing we've seen on the enterprise
00:12:08is senior engineers, higher up folks will write tickets.
00:12:12So they'll actually do the work
00:12:13of writing out all the spec of what needs to be done.
00:12:16They'll hand it off to a junior engineer to actually do.
00:12:18The junior engineer takes that super well-written ticket
00:12:20and gives it to the agent to do, right?
00:12:21So you're just arbitraging the idea
00:12:23that the junior engineer will actually do the agent work
00:12:26because they're more comfortable doing that.
00:12:28But the senior engineer is the person
00:12:29who's actually really good at writing the spec,
00:12:31very good at understanding
00:12:32what are the architectural decisions we should be making
00:12:35and putting that into some kind of ticket.
00:12:37- Yeah, for those who don't know,
00:12:40Matan and factory in general have been writing
00:12:42and advocating about the age of native development.
00:12:44So you can read more on their website.
00:12:45I think one thing, by the way,
00:12:48I do wanna issue maybe like one terminology thing,
00:12:51which is raise the floor for you is a good thing.
00:12:54I think actually other people say lower the floor
00:12:55also mean the same thing.
00:12:57Basically just like it's about skill level
00:12:59and like what they can do
00:13:00and just giving people more resources for that.
00:13:05I think also the other thing is like,
00:13:07a lot of people are thinking about the model layer, right?
00:13:13Obviously you guys own your own models, the two of you don't.
00:13:18And I think there's a hot topic of conversation
00:13:21in the value right now.
00:13:22Airbnb, Brian Chesky has said that
00:13:25like most of the value was like relies on Quinn apparently.
00:13:28How important is open models to you guys
00:13:30and you can, for what you can chime in as well,
00:13:33but like how important is open models
00:13:35as a strategy for both of you?
00:13:37- I'd be curious to hear from you first.
00:13:38- Yeah.
00:13:38Well, love open models.
00:13:42I think one of the important things about,
00:13:44so just being able to talk about models,
00:13:45I think openness is really key
00:13:48to I think a sustainable development lifecycle
00:13:51where with Codex CLI, we open sourced it out the gate
00:13:54and part of the priority was understanding
00:13:57that an open model was coming down the line.
00:13:58We wanted to make sure that we could as best document
00:14:01how to use our reasoning models.
00:14:02We saw a lot of kind of confusion about,
00:14:05what kind of tools to give it,
00:14:06what the environment should be, the resources.
00:14:08And so we want to make sure that that was as clear as possible
00:14:10and then also make sure that it worked well with open models.
00:14:12So I think there are definitely a lot of use cases,
00:14:14especially when you get into kind of embedded use cases
00:14:18or where cases where you don't want the data
00:14:22to leave the perimeter.
00:14:23There's a lot of really good reasons
00:14:25for why you would want to do that.
00:14:26And then I think the benefit of kind of cloud-hosted models,
00:14:31and that's what we see with a lot of open models.
00:14:33They end up being, they're not run on device,
00:14:35but they're actually cloud-hosted anyway,
00:14:37maybe for efficiency, maybe for cost,
00:14:39that there's still a lot of value
00:14:42in just the pure intelligence that you get
00:14:44from using a much bigger model.
00:14:46And that's why we see people really gravitate
00:14:48towards models from O3 to GBD5 to GBD5 Codex.
00:14:52There's still a lot of value in that.
00:14:53Now we see that that overhang still kind of comes,
00:14:57it resolves itself where every couple of months
00:15:01there's a new, very small, very, very impressive model.
00:15:04And I think that's the magic
00:15:05if we just consider at the beginning of this year,
00:15:06we had O3 mini as kind of the frontier and where we are now.
00:15:10And so, yeah, I think that there's a ton of value
00:15:13in open models, but still, I think personally,
00:15:17from a usage perspective,
00:15:18more value in using the kind of cloud-hosted ones.
00:15:21- Yeah, I'll just interject a bit.
00:15:23Ford actually cares a lot about privacy,
00:15:25security, agent robustness.
00:15:27And so if you run into him, talk to him more about that.
00:15:30But for both of you guys, maybe you wanna start off with,
00:15:33actually, what's your ballpark
00:15:35of open model token percentage generated
00:15:38in your respective apps?
00:15:39And is it gonna go up or down?
00:15:42- So I guess, so maybe to start,
00:15:44'cause I think what you said is really interesting.
00:15:47So a couple of weeks ago,
00:15:48when we released our factory CLI tool,
00:15:52people were really interested
00:15:53because we also released with it
00:15:54our score on this benchmark called Terminal Bench.
00:15:57And one of the first asks was,
00:15:59can you guys put open source models to the test?
00:16:01'Cause our droid agent is fully model agnostic.
00:16:04So immediately people were like,
00:16:06throw in the open source models and show us how it does.
00:16:09And I think something that was particularly surprising
00:16:12was that the open source models,
00:16:14and in particular GLM, were really, really good.
00:16:17They were in fact obviously less performant
00:16:19than the frontier models,
00:16:21but not by a huge margin.
00:16:24I think, so one thing that was noteworthy though
00:16:26was when we benchmarked the open source models,
00:16:29of the seven that were at the top,
00:16:32one of them was made in the United States
00:16:34by yours truly over here,
00:16:36which I think is kind of a shame.
00:16:37Like the fact that by far of the frontier models,
00:16:41it's United States across the board.
00:16:43But then when it comes to open source,
00:16:45we're really dropping the ball there.
00:16:47So I think that's one thing that's noteworthy
00:16:49and I think something that, at least when I saw that,
00:16:52I really think there should be like a call to arms there
00:16:54in terms of changing that.
00:16:56Because I think to answer your question,
00:16:59what we found is that since we released support
00:17:02for open source models,
00:17:03the percent of people that are using open source models
00:17:06has dramatically risen.
00:17:08Partially because of cost and that, you know,
00:17:11it's allows you like,
00:17:12let's say in that documentation example,
00:17:15maybe you want to generate docs,
00:17:16but you don't want it to be like,
00:17:17you know, on super high reasoning, like to the max,
00:17:19like cost you a thousand dollars,
00:17:21but you just want to get like some initial first pass in.
00:17:24And also people like having a little bit more control.
00:17:28And I feel like they get a lot more of that control
00:17:30with some of these open source models,
00:17:33both control and the cost and just like kind of observability
00:17:36into what's actually happening there.
00:17:39So I think the demand has grown to a point
00:17:42where I actually did not expect a year ago.
00:17:43I think a year ago, I was less bullish on open source models
00:17:47than I am now, open-weight, but yeah.
00:17:49- Yeah, I think we use both open source
00:17:51and closed source models in our overall agent pipeline.
00:17:54And I think the way we think about them
00:17:56is there's two different use cases for an LLM call.
00:17:58One is you want state-of-the-art reasoning.
00:18:01It's a very, very open-ended question.
00:18:02You actually don't know what the answer is.
00:18:04The goal is like,
00:18:05the goal function is not super well-defined.
00:18:07In those cases,
00:18:09closed source models are still state-of-the-art
00:18:11when it comes to reasoning and intelligence.
00:18:13We use closed source models pretty much exclusively
00:18:15for those kinds of use cases.
00:18:16There's a second use case where we have a more niche task
00:18:20with a much clearer goal function.
00:18:22In those cases, we almost always try to fine tune
00:18:25an open source model.
00:18:26We're okay taking a 20% cut hit maybe
00:18:29in terms of reasoning ability
00:18:31so that we can actually fine tune
00:18:33a very, very specific use case.
00:18:35And I think we found that open source models
00:18:37are catching up very, very, very fast.
00:18:39A year and a half ago, it was unthinkable for us
00:18:42to be able to use open source models
00:18:43as part of v0's pipeline.
00:18:45Today, every single part of the pipeline,
00:18:47we're like, okay, can we bring open source models into this?
00:18:49Can we replace what we're doing currently
00:18:52with closed source state-of-the-art frontier models
00:18:55with a fine tune of an open source model?
00:18:57And we've seen a ton of success with Quen, QEMI-K2,
00:19:00other kinds of models like that.
00:19:02- Yeah, I'll call this out as one of the biggest deltas
00:19:05I've seen across everyone,
00:19:07which is at the start of this year,
00:19:08I did a podcast with Ankur from BrainTrust,
00:19:10and he said that open source model usage is roughly 5%
00:19:14across what BrainTrust is seeing, and going down.
00:19:17And now I think reasonably it's gonna go
00:19:19to between the 10 to 20% range for everybody.
00:19:22- I do think it's interesting that even closed source models
00:19:25are investing more heavily into their small class models.
00:19:29The Haikus, GPD5 Minis, Gemini Flashes of the world,
00:19:33which I think also is that model class
00:19:35is what competes with open source the most.
00:19:38It's the small model class competing against a fine tune
00:19:40of an open source model.
00:19:42- And I also think there's some use cases
00:19:43where it's just, it will just be overkill
00:19:46to use a frontier model, and if it is overkill,
00:19:49you are then just gonna obviously be incentivized
00:19:51to use something that's faster and cheaper.
00:19:53And I think part of that, part of I think this delta
00:19:56in terms of percent usage is there is this threshold
00:19:59of when open models cross the threshold of for most tasks,
00:20:04it's actually enough, and then for some niche tasks,
00:20:06you need like the extra firepower.
00:20:10I think we're really getting there
00:20:11with some of these open models,
00:20:12which is why I would suspect
00:20:13we'll see more usage going forward.
00:20:16- Yeah, awesome, that's very encouraging.
00:20:18So we have a bit of time left to prep to you guys
00:20:20with the closing question, which is,
00:20:22what's something that your agents cannot do today
00:20:25that you wish they could do,
00:20:26that they'll probably do it next year?
00:20:27- Am I going first?
00:20:31Okay.
00:20:32Yeah, I think that what we've seen over the last year,
00:20:34just maybe starting as a reference point with 01,
00:20:38a little over a year ago, or 01 preview,
00:20:40what we've seen from then,
00:20:42when I was using very early checkpoints of that model,
00:20:47it was great relative to 40,
00:20:49but still had so much left to be desired.
00:20:51I wouldn't put it, I was on the security team at the time,
00:20:55and there was a lot of work and tasks
00:20:57that I just couldn't delegate to that model.
00:21:00And when we compare it to today,
00:21:01where I can take a pretty well-defined task,
00:21:04like maybe it's like two sentences,
00:21:06a few bullet points to your point,
00:21:07like here are the gotchas
00:21:08that I think you'll probably get stuck on,
00:21:10and then come back and 30 minutes later,
00:21:12an hour later, it's done it.
00:21:14We've seen cases where it's running for many hours,
00:21:17maybe even seven to eight hours,
00:21:19effectively a full workday
00:21:20that I spend a lot of my day in meetings,
00:21:22and so don't necessarily have that solid block of time.
00:21:26But that's only half of what engineering is really about.
00:21:30Part of it is coding, part of it is architecting
00:21:32and troubleshooting and debugging.
00:21:34The other half of the problem is writing docs,
00:21:36is understanding the system, convincing people.
00:21:39And so I think what we'll start to see
00:21:41is this super collaborator where what we want to bring,
00:21:45whether it's in codecs or these other interfaces
00:21:48through the codecs model is the ideal collaborator
00:21:53that you want to work with.
00:21:53The person you first go to, that favorite coworker
00:21:56that you want to jam on ideas with,
00:21:58that's really what we want to see, at least with codecs.
00:22:02I think for us, we've seen a bunch of rapid progression
00:22:05on two different fronts.
00:22:07The first is how many steps can you reasonably expect
00:22:10an agent to be able to do and get reasonably good output?
00:22:14Last year, there's probably one, maybe max three, right?
00:22:17If you wanted reliable output with over 90% success,
00:22:20you're probably running one to three agent steps.
00:22:22Today, most tools run five to 20
00:22:24with no really great reliability rates, over 90% success.
00:22:29I think next year, we're gonna add in
00:22:30sort of that like 100 plus, 200 plus,
00:22:32let's run tons of steps all at once,
00:22:34have long running tasks for multiple hours
00:22:36and be confident that you'll get an output
00:22:38at the end that will be useful.
00:22:40The second is in terms of what resources can be consumed.
00:22:42A year ago, it was whatever you are putting
00:22:44into the prompt form, like that was pretty much it.
00:22:47Today, you can now configure external connections via MCP
00:22:51or by making API calls directly in your application.
00:22:55You can kind of do that if you're knowledgeable,
00:22:57you have the ability to configure things.
00:22:58And I think in a year from now, those will just happen.
00:23:00Like it will just work.
00:23:02The goal is like, you should not need to know
00:23:03what sources of context you need to give the agent.
00:23:06The agent will actually go and find
00:23:08those sources of context proactively.
00:23:09We're kind of starting to see that already today,
00:23:12but I'm still not really confident
00:23:14that's very reliable and useful today.
00:23:16I think by next year, that'll be the default mode.
00:23:18- Yeah, I would agree with that.
00:23:19I think agents can do basically everything today,
00:23:23but the degree to which they do so reliably and proactively
00:23:27is I think the slider that is going to change.
00:23:29But that's a slider that's also dependent on the user.
00:23:31Like if you're a user who's like not really like
00:23:33changing your behavior and meeting the agent where it is,
00:23:36then you might get lower reliability and proactivity.
00:23:38Whereas if you kind of set up your harness correctly
00:23:41or set up your environment correctly,
00:23:42it'll be able to do more of that
00:23:44reliably and more proactively.
00:23:45- Yeah, amazing.
00:23:46Well, we're out of time.
00:23:48My contribution is computer vision.
00:23:49Everyone try Atlas.
00:23:51Everyone try like more computer vision use cases,
00:23:53but thank you so much for your time.
00:23:55- Thank you.
00:23:56(audience applauding)
00:23:57(upbeat music)

Key Takeaway

AI coding agents are rapidly evolving from experimental tools into mainstream productivity multipliers, with success determined by user skill, task clarity, and the emerging convergence of local and cloud capabilities.

Highlights

AI coding agents enable developers to achieve 10x productivity multipliers, with the effectiveness varying based on user skill level and specificity of task definition

The debate between local versus cloud deployment is evolving toward convergence, with resource location and agent loop flexibility becoming key architectural considerations

Open-source models are gaining significant traction (projected to grow from 5% to 10-20% usage) as they approach frontier model capabilities while offering better cost efficiency and privacy benefits

Agent reliability and proactivity are expected to advance dramatically, with multi-hour, 100+ step workflows becoming standard within the next year

Successful agent usage requires specific task definition, clear output specifications, and proper delegation practices—treating agents as amplifiers of existing craft rather than magic solutions

The industry is moving toward agent-native development where junior engineers with strong delegation skills become arbiters between senior engineers' specifications and agent execution

Computer vision capabilities for AI agents remain an underdeveloped frontier that presents significant opportunity for future enhancement

Timeline

Introduction and Company Positioning

The panel introduces three major players in AI coding: Factory (focusing on end-to-end software development agents called 'droids'), OpenAI (developing Codex CLI and cloud solutions), and Vercel (building V0 for preview-driven agentic programming). Each organization explains their strategic approach to bringing AI capabilities into the software development lifecycle. Factory emphasizes handling not just coding but documentation, testing, and review. OpenAI highlights the integration of reasoning models with sandboxing primitives. Vercel describes V0's goal of enabling developers to prompt directly against running applications, representing the future of developer experience.

Local vs. Cloud Deployment Strategy

The panelists discuss the strategic differences between local and cloud agent deployment, with Factory starting local, OpenAI moving from cloud to local, and V0 being cloud-only. They reach consensus that the choice depends on resource location—local for file systems, cloud for remote repositories like GitHub. The fundamental insight is that humans work flexibly across both environments, so agents should similarly converge toward supporting both seamlessly. Factory notes that remote is better for smaller, well-defined tasks, while local suits larger or more complex tasks requiring close monitoring. Vercel adds that agent loops, tool calls, and resources must be independently flexible, with resource availability determining the optimal execution location rather than fixed architectural choices.

Quality, Leverage, and the 'Slop' Question

The panel addresses concerns about whether AI agents generate low-quality 'slop' or provide genuine value. The consensus is that perceived sloppiness depends on guidance quality and task specificity—generic prompts produce generic results, while detailed specifications with test verification yield structured output. Matan uses a documentation example, noting that auto-generated docs for legacy code, while imperfect, provide more value than no documentation. The broader insight is that agents operate as leverage tools: a principal engineer can produce 50,000 lines of code daily with agents versus 5,000 manually, maintaining quality proportional to input specification. V0 emphasizes that agents amplify existing craft skills—they raise the floor for beginners but don't guarantee quality without proper guidance and verification mechanisms.

Enterprise Adoption Patterns and Delegation Skills

Factory reveals an interesting bimodality in enterprise adoption: junior developers are philosophically open to agent-native development but lack delegation experience, while senior engineers are excellent at specifying tasks but resistant to changing their workflow. Vercel observes that enterprises are successfully arbitraging this dynamic, with senior engineers writing detailed specifications as tickets, junior engineers handing them to agents for execution. This creates an unexpected organizational model where technical leadership becomes specification-writing work while implementation shifts to agent execution managed by mid-level engineers. The panel emphasizes that successful agent usage mirrors team management practices—clear specifications, well-defined outputs, and understanding what information agents need to succeed are critical factors.

Open Source Models: Rising Adoption and Strategic Use

The discussion pivots to open-source models, with projections suggesting usage will grow from approximately 5% (start of 2024) to 10-20% by year-end. Matan highlights that open-source models like Qwen performed surprisingly well on Factory's Terminal Bench benchmark, though all top frontier models remain US-based. OpenAI explains their balanced strategy: closed-source models for open-ended reasoning tasks with unclear goal functions, while open-source models excel in niche tasks with clear objectives where fine-tuning enables specialized performance. V0 reports successfully replacing closed-source components with fine-tuned open models in their pipeline over the past year. The panel identifies a critical threshold: as open models cross the 'good enough' line for most tasks, they become the default choice due to cost, privacy, control, and observability benefits, with frontier models reserved for genuinely complex reasoning.

Future Capabilities: Multi-Hour Tasks and Proactive Context Gathering

Panelists discuss their vision for the next year of AI agent development, focusing on two primary advancement areas. First, agent step complexity will scale from today's 5-20 reliable steps to 100+ steps with multi-hour execution windows maintaining over 90% success rates, enabling full workday-equivalent tasks. Second, context gathering will shift from manual specification to proactive agent behavior—agents will automatically identify and fetch necessary resources via MCP connections without requiring users to know which sources are relevant. OpenAI emphasizes the need for agents to evolve from task executors into collaborators and 'favorite coworkers,' handling not just coding but also architecture, debugging, documentation, and persuasion. The panel agrees that the reliability and proactivity slider will continue advancing, though user behavior and environment setup significantly influence achievable performance levels.

Closing Remarks and Computer Vision Opportunities

The moderator closes with remarks emphasizing that agents can accomplish virtually everything today, but reliability and proactivity remain the key advancement vectors. The discussion acknowledges that agent effectiveness depends on both technological capabilities and user adaptation—proper environment setup and behavioral alignment with agent constraints yield significantly higher performance. The moderator highlights computer vision as an underdeveloped frontier, encouraging the audience to explore CV use cases with Atlas. The panel thanks the audience as the session concludes with applause, leaving the impression that practical, real-world agent deployment is rapidly maturing while future advances will focus on reliability, autonomy, and capability expansion.

Community Posts

View all posts