Coding for the Future Panel

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareManagementInternet Technology

Transcript

00:00:00(upbeat music)

00:00:02- Welcome to the Future of AI Coding panel.

00:00:04Thank you for reading the memo

00:00:05that you have to wear all black.

00:00:07(laughing)

00:00:09Okay, so I do want to cover a little bit of introductions.

00:00:12I know each of you in different ways,

00:00:15but maybe the audience, hopefully, doesn't quite.

00:00:17Matan, why don't you go first?

00:00:19What is Factory's position

00:00:24to the broader world in AI coding?

00:00:26- Yeah, so at Factory,

00:00:28our mission is to bring autonomy to software engineering.

00:00:32And what that means more concretely,

00:00:34we have built end-to-end software development agents

00:00:37called droids.

00:00:38They don't just focus on the coding itself,

00:00:40but really the entire end-to-end

00:00:42software development lifecycle.

00:00:43So things like documentation, testing, review,

00:00:48kind of all the ugly parts so that you can also do

00:00:51the more fun parts like the coding itself.

00:00:52And for the parts of the coding you don't want to do,

00:00:54you can also have the droids do that.

00:00:56So you build droids.

00:00:58You build droids.

00:00:59And OpenAI obviously needs some introduction,

00:01:02but your role on the codecs team,

00:01:05I saw you pop up on the codecs video.

00:01:08That's how I knew it was you working on it.

00:01:10But how do you think about codecs these days

00:01:13since it's expanded a lot?

00:01:14- Yeah, so earlier this year,

00:01:16we launched our first coding agent.

00:01:19I worked on codec CLI,

00:01:21bringing the power of our reasoning models

00:01:23into people's computers.

00:01:26Then we released codecs cloud where you could actually

00:01:28distribute and delegate those tasks to work in the cloud.

00:01:31And over the last some odd months,

00:01:33we've been unifying these experiences.

00:01:34So they work as seamlessly as possible.

00:01:36So a lot of our focus is around how do we make

00:01:38the fundamentals, the primitives as useful as possible.

00:01:41We just released a dev day codecs SDK.

00:01:43So I think one of the key directions we've been seeing

00:01:46is not just using coding or code executing agents for coding,

00:01:50but also for general purpose tasks.

00:01:52And so whether it was try to be the agent,

00:01:54which I worked on earlier this year

00:01:55that actually executes code in the background

00:01:57to accomplish some tasks,

00:01:59but starting to enable our developers to build on top of

00:02:02not just the reasoning models,

00:02:04but also things like sandboxing

00:02:05and all the other primitives that we built into codecs.

00:02:07- Awesome.

00:02:09V0?

00:02:09- Yeah, the goal of V0 is to enable developers

00:02:14to do preview driven agentic programming.

00:02:16So today when you build web apps,

00:02:19you probably have an agent open,

00:02:21your IDE open, so some kind of code,

00:02:23and then a preview of what you're actually building.

00:02:25Usually you're running dev server.

00:02:26With V0, our goal is to allow you to just have

00:02:28an agent running and directly prompt against your running app.

00:02:32And that's how we think the future of DX is gonna pan out.

00:02:35- Okay, awesome.

00:02:36And everyone has different surface areas

00:02:38in which to access your coding agents.

00:02:40So I think one of the things we kinda wanna kick off with is

00:02:43how important is local versus cloud?

00:02:45You started local with cloud,

00:02:47you started cloud with local, you're cloud only for now.

00:02:50What's the split?

00:02:52Is everyone just gonna merge eventually?

00:02:55- Yeah, so maybe I can start there.

00:02:58So I think at the end of the day,

00:02:59the point of these agents is that

00:03:02they are as helpful as possible

00:03:04and they have a very similar silhouette

00:03:06to that of a human that you might work with.

00:03:08And you don't have local humans and remote humans

00:03:11that are like somehow, you know,

00:03:13this one only works in this environment,

00:03:15this one only works in that environment.

00:03:16Generally, humans can be helpful

00:03:18whether you're in a meeting with them

00:03:19and you come up with an idea

00:03:20or you're sitting like shoulder to shoulder at a computer.

00:03:24So I guess asymptotically, these need to become the same,

00:03:28but I think in the short term,

00:03:29remote is typically, what we're seeing is it's typically

00:03:34more useful for smaller tasks that you're more confident

00:03:37that you can delegate reliably.

00:03:39Whereas local is when you wanna be

00:03:41a little bit closer to the agent,

00:03:43it's maybe some larger task or some more complicated task

00:03:46that you're gonna kind of actively be monitoring.

00:03:49And you want it to be local so that if something goes wrong,

00:03:52you don't need to pull that branch back down

00:03:54and then start working on it,

00:03:55but instead you're right there to guide it.

00:03:57- Yeah, maybe I'm just greedy, but I want both.

00:04:00And I think having a modality to Matan's point

00:04:04where I like to think about what are the primary forms

00:04:07of collaboration that I'm used to

00:04:08and I enjoy with my coworkers.

00:04:11Often that starts something like a whiteboarding session

00:04:13and maybe we're just like jamming on something in a room.

00:04:17When we were building, I think a good example

00:04:19was agents.md, which is our custom instructions

00:04:23intended to be generic across different coding agents.

00:04:26The way that it started was Romain and I

00:04:28were just in a room coming up with this idea.

00:04:31Then we just started whiteboarding and then took a photo

00:04:33and then kicked it off in codec CLI locally,

00:04:36just like in a workshop at Next.js app that we could work on,

00:04:40went to lunch, came back.

00:04:41It had a good amount of the kind of core structure.

00:04:44And then from there, we were able to iterate

00:04:45a little bit more closely.

00:04:46So having that kind of pairing

00:04:48and kind of brainstorm style experience.

00:04:49And then I think to that second point

00:04:51about what kind of tasks you delegate to,

00:04:54I think historically smaller, monarily scoped tasks

00:04:57where you're very clear about what the output is,

00:05:00is kind of the right modality

00:05:01if you're doing a fire and forget.

00:05:02But I think what we're starting to see with,

00:05:04we just launched GBD5 codecs about two months ago now.

00:05:08And I think one of the main differences

00:05:09is that it can actually do these longer running,

00:05:11more complex, more ambiguous tasks,

00:05:14as long as you are clear about what you want by the end.

00:05:16So it can work for hours at a time.

00:05:18I think that shift as models increase in capability

00:05:21will start to enable more kind of use cases.

00:05:24- Yeah.

00:05:24Yeah, I think there are three parts of making an agent work.

00:05:27There's the actual agent loop,

00:05:29there are the tool calls it makes,

00:05:30and then the resources upon which the tool calls need to act.

00:05:33Whether you go cloud or local first

00:05:35is based on where those resources are, right?

00:05:37If you're trying to work on a local file system,

00:05:39those are the resources you need to access.

00:05:41It totally makes sense

00:05:42that your agent loop should run locally, right?

00:05:44If you're accessing resources that typically exist in the cloud

00:05:46you're pulling from GitHub,

00:05:47directly from like third party repo of some kind,

00:05:51then it makes sense for your agent

00:05:52to start off in the cloud, right?

00:05:54Ultimately though, these resources exist in both places, right?

00:05:57Every developer expects an agent to be able to work

00:06:00both on the local file system,

00:06:02as well as on an open PR that might be hosted on GitHub.

00:06:04And so it doesn't really matter where you start, I think,

00:06:07everyone is converging at the same place,

00:06:08which is that your agent loop needs to be able to run anywhere,

00:06:11your tool calls need to be able to be streamed

00:06:13from the cloud locally or from a local backup to the cloud.

00:06:16And then it all depends on where the resources

00:06:18you actually want to act on are located.

00:06:20- Yeah, awesome.

00:06:22Okay, so we were chatting off stage

00:06:24and we were casting around for spicy questions and stuff.

00:06:27So I really liked this one and I think it's very topical.

00:06:31Do you guys generate slop as a living?

00:06:33Like are we in danger of potentially being in a hype bubble

00:06:40where we believe that this is like a sustainable path to AGI?

00:06:44- I mean, I think to start, you could say that one man's slop

00:06:48is another man's treasure, which to some extent might be true.

00:06:52Like, you know, if for example, you have, I don't know,

00:06:56like let's suppose you had a repo

00:06:58that had no documentation whatsoever.

00:07:00You could use, you know, many of the tools

00:07:04that we've been talking about to go and generate

00:07:06documentation for this repo.

00:07:08Now, is it gonna be the most like finely crafted

00:07:12piece of documentation?

00:07:13No, but is it providing alpha?

00:07:16Yes, in my mind, because having to like sift through

00:07:19some super old legacy code base that has no docs

00:07:22is a lot harder than looking through

00:07:23some somewhat sloppified documentation.

00:07:26And so I think the big thing is it's figuring out

00:07:29where you can use these tools for leverage

00:07:32and the degree to which it's slop,

00:07:35I think also kind of depends on how much guidance you provide.

00:07:38So if you just say like, build me an app that does this,

00:07:40like you're probably gonna get some generic slop app

00:07:43that does--

00:07:44- It's purple.

00:07:44- Yeah, blue, purple like fade, yeah.

00:07:48Whereas if instead you're like very methodical

00:07:50about exactly what it is that you want,

00:07:52you provided the tools to actually run tests

00:07:54to verify some of the capabilities that you're requesting.

00:07:58I think that makes it much more structured

00:08:00to a similar extent that if you were to, you know,

00:08:03hire some junior engineer onto your team

00:08:06and you just say, hey, go do this.

00:08:08Like they're probably gonna yield some like median outcome

00:08:11because they have no other specification to go off of.

00:08:14And it's pretty ambiguous like what you actually want done.

00:08:19- I think the key word there is leverage, right?

00:08:21Like what AI coding agents allow you to do

00:08:23is do 10X more than you would be able to do yourself

00:08:25with a pretty high floor, right?

00:08:27So if you plot skill level against how useful an agent is

00:08:30or how likely it is, you know,

00:08:31how useful it actually is in generating non-slop,

00:08:33there's probably a like pretty low floor

00:08:35if you have no skill.

00:08:36You have a pretty high floor still, right?

00:08:38Agents are pretty good just out of the box.

00:08:39If you don't know anything about development,

00:08:41the agent is gonna do much more than you could possibly do.

00:08:44But as you get to higher and higher skill levels,

00:08:46senior and principal and distinguished engineers

00:08:48actually use agents differently.

00:08:50They're using it to level up

00:08:51the things they could already do.

00:08:53You know, a principal engineer might be able to

00:08:55write manually 5,000 lines of code a day.

00:08:57With agents, they can write like 50,000 lines of code a day.

00:09:00And it really operates at the level of quality of the inputs

00:09:03and the knowledge that you put in there.

00:09:04So I think we're, you know, slowly raising the floor

00:09:07over time by, you know, building better agents.

00:09:11But I do think it's a form of leverage.

00:09:14It's a way for you to accelerate

00:09:16the kinds of things you can already do, do them faster.

00:09:18And for folks who don't have skills, you know,

00:09:20that's when you can actually really raise the floor

00:09:22of what it can be do.

00:09:23- Absolutely, and just to add on to both of these points,

00:09:26I think they're tools and amplifiers of craft.

00:09:29If you have it, you can do more of it.

00:09:31If you don't, it is just harder,

00:09:32but it does raise the floor.

00:09:34I think that's really worth calling out.

00:09:36I think for folks who are just trying

00:09:39to build their first prototype,

00:09:40they're trying to iterate an idea

00:09:42that example was mentioning earlier.

00:09:44It's not that like I couldn't make a front end

00:09:47that kind of is like a content-driven site,

00:09:50but I just didn't have time.

00:09:51And it was more fun to just draw on a whiteboard,

00:09:53talk, have a conversation, and then kick it off to an agent.

00:09:57But I think one of the interesting examples of this

00:09:58was when we were building much earlier iterations of codecs

00:10:01and well over a year ago.

00:10:03And we were putting in front of two different archetypes,

00:10:05folks who did a lot of product engineering

00:10:08where they're used to using local,

00:10:12in the inner loop style tools

00:10:14where they're used to just chatting and maybe iterating.

00:10:18And then a completely different modality

00:10:20when we talk to folks on the reasoning teams

00:10:23where they would sit for maybe five minutes

00:10:25just defining the task and have an essay length,

00:10:29like word problem for the agent to go off and do,

00:10:32and then it would work for an hour.

00:10:33And that was effectively 01 or earlier kind of versions of it.

00:10:37And I think the interesting part there

00:10:39was just the way that people would approach

00:10:41giving the task to the agent was completely different

00:10:44based on their understanding of what do they think it needs.

00:10:48And so I think really anchoring on specificity,

00:10:52being really clear about what you want the output to be.

00:10:55And I think there's a broader item

00:10:56that is a responsibility on both us as builders of agents

00:11:00and folks training models to really raise that floor

00:11:04and to ensure that the ceiling

00:11:06for people with high craftsmanship, with high taste

00:11:08are able to exercise that in the way that they see fit.

00:11:11- I think actually something that you've mentioned

00:11:13brought this idea to mind that we've started to notice.

00:11:16So our target audience is the enterprise.

00:11:19And something that we've seen occur time and again

00:11:21is that there's a very interesting bimodality

00:11:24in terms of adoption of agent native development.

00:11:28And in particular, normally earlier in career developers

00:11:32are more open-minded to start building

00:11:34in an agent native way,

00:11:36but they don't have the experience

00:11:38of managing engineering teams.

00:11:39So they're maybe not the most familiar with delegation

00:11:42in a way that works very well.

00:11:44Meanwhile, more experienced engineers

00:11:46have a lot of experience delegating.

00:11:47They know that, hey, if I don't specify these exact things,

00:11:50it won't get done.

00:11:51And so they're really good at like writing out that paragraph,

00:11:54but they're pretty stubborn

00:11:56and they actually don't wanna change the way that they build

00:11:59and you're gonna have to pry Emacs

00:12:01out of their cold dead hands.

00:12:03So it's an interesting balance there.

00:12:05- So funny you say that.

00:12:06Similar thing we've seen on the enterprise

00:12:08is senior engineers, higher up folks will write tickets.

00:12:12So they'll actually do the work

00:12:13of writing out all the spec of what needs to be done.

00:12:16They'll hand it off to a junior engineer to actually do.

00:12:18The junior engineer takes that super well-written ticket

00:12:20and gives it to the agent to do, right?

00:12:21So you're just arbitraging the idea

00:12:23that the junior engineer will actually do the agent work

00:12:26because they're more comfortable doing that.

00:12:28But the senior engineer is the person

00:12:29who's actually really good at writing the spec,

00:12:31very good at understanding

00:12:32what are the architectural decisions we should be making

00:12:35and putting that into some kind of ticket.

00:12:37- Yeah, for those who don't know,

00:12:40Matan and factory in general have been writing

00:12:42and advocating about the age of native development.

00:12:44So you can read more on their website.

00:12:45I think one thing, by the way,

00:12:48I do wanna issue maybe like one terminology thing,

00:12:51which is raise the floor for you is a good thing.

00:12:54I think actually other people say lower the floor

00:12:55also mean the same thing.

00:12:57Basically just like it's about skill level

00:12:59and like what they can do

00:13:00and just giving people more resources for that.

00:13:05I think also the other thing is like,

00:13:07a lot of people are thinking about the model layer, right?

00:13:13Obviously you guys own your own models, the two of you don't.

00:13:18And I think there's a hot topic of conversation

00:13:21in the value right now.

00:13:22Airbnb, Brian Chesky has said that

00:13:25like most of the value was like relies on Quinn apparently.

00:13:28How important is open models to you guys

00:13:30and you can, for what you can chime in as well,

00:13:33but like how important is open models

00:13:35as a strategy for both of you?

00:13:37- I'd be curious to hear from you first.

00:13:38- Yeah.

00:13:38Well, love open models.

00:13:42I think one of the important things about,

00:13:44so just being able to talk about models,

00:13:45I think openness is really key

00:13:48to I think a sustainable development lifecycle

00:13:51where with Codex CLI, we open sourced it out the gate

00:13:54and part of the priority was understanding

00:13:57that an open model was coming down the line.

00:13:58We wanted to make sure that we could as best document

00:14:01how to use our reasoning models.

00:14:02We saw a lot of kind of confusion about,

00:14:05what kind of tools to give it,

00:14:06what the environment should be, the resources.

00:14:08And so we want to make sure that that was as clear as possible

00:14:10and then also make sure that it worked well with open models.

00:14:12So I think there are definitely a lot of use cases,

00:14:14especially when you get into kind of embedded use cases

00:14:18or where cases where you don't want the data

00:14:22to leave the perimeter.

00:14:23There's a lot of really good reasons

00:14:25for why you would want to do that.

00:14:26And then I think the benefit of kind of cloud-hosted models,

00:14:31and that's what we see with a lot of open models.

00:14:33They end up being, they're not run on device,

00:14:35but they're actually cloud-hosted anyway,

00:14:37maybe for efficiency, maybe for cost,

00:14:39that there's still a lot of value

00:14:42in just the pure intelligence that you get

00:14:44from using a much bigger model.

00:14:46And that's why we see people really gravitate

00:14:48towards models from O3 to GBD5 to GBD5 Codex.

00:14:52There's still a lot of value in that.

00:14:53Now we see that that overhang still kind of comes,

00:14:57it resolves itself where every couple of months

00:15:01there's a new, very small, very, very impressive model.

00:15:04And I think that's the magic

00:15:05if we just consider at the beginning of this year,

00:15:06we had O3 mini as kind of the frontier and where we are now.

00:15:10And so, yeah, I think that there's a ton of value

00:15:13in open models, but still, I think personally,

00:15:17from a usage perspective,

00:15:18more value in using the kind of cloud-hosted ones.

00:15:21- Yeah, I'll just interject a bit.

00:15:23Ford actually cares a lot about privacy,

00:15:25security, agent robustness.

00:15:27And so if you run into him, talk to him more about that.

00:15:30But for both of you guys, maybe you wanna start off with,

00:15:33actually, what's your ballpark

00:15:35of open model token percentage generated

00:15:38in your respective apps?

00:15:39And is it gonna go up or down?

00:15:42- So I guess, so maybe to start,

00:15:44'cause I think what you said is really interesting.

00:15:47So a couple of weeks ago,

00:15:48when we released our factory CLI tool,

00:15:52people were really interested

00:15:53because we also released with it

00:15:54our score on this benchmark called Terminal Bench.

00:15:57And one of the first asks was,

00:15:59can you guys put open source models to the test?

00:16:01'Cause our droid agent is fully model agnostic.

00:16:04So immediately people were like,

00:16:06throw in the open source models and show us how it does.

00:16:09And I think something that was particularly surprising

00:16:12was that the open source models,

00:16:14and in particular GLM, were really, really good.

00:16:17They were in fact obviously less performant

00:16:19than the frontier models,

00:16:21but not by a huge margin.

00:16:24I think, so one thing that was noteworthy though

00:16:26was when we benchmarked the open source models,

00:16:29of the seven that were at the top,

00:16:32one of them was made in the United States

00:16:34by yours truly over here,

00:16:36which I think is kind of a shame.

00:16:37Like the fact that by far of the frontier models,

00:16:41it's United States across the board.

00:16:43But then when it comes to open source,

00:16:45we're really dropping the ball there.

00:16:47So I think that's one thing that's noteworthy

00:16:49and I think something that, at least when I saw that,

00:16:52I really think there should be like a call to arms there

00:16:54in terms of changing that.

00:16:56Because I think to answer your question,

00:16:59what we found is that since we released support

00:17:02for open source models,

00:17:03the percent of people that are using open source models

00:17:06has dramatically risen.

00:17:08Partially because of cost and that, you know,

00:17:11it's allows you like,

00:17:12let's say in that documentation example,

00:17:15maybe you want to generate docs,

00:17:16but you don't want it to be like,

00:17:17you know, on super high reasoning, like to the max,

00:17:19like cost you a thousand dollars,

00:17:21but you just want to get like some initial first pass in.

00:17:24And also people like having a little bit more control.

00:17:28And I feel like they get a lot more of that control

00:17:30with some of these open source models,

00:17:33both control and the cost and just like kind of observability

00:17:36into what's actually happening there.

00:17:39So I think the demand has grown to a point

00:17:42where I actually did not expect a year ago.

00:17:43I think a year ago, I was less bullish on open source models

00:17:47than I am now, open-weight, but yeah.

00:17:49- Yeah, I think we use both open source

00:17:51and closed source models in our overall agent pipeline.

00:17:54And I think the way we think about them

00:17:56is there's two different use cases for an LLM call.

00:17:58One is you want state-of-the-art reasoning.

00:18:01It's a very, very open-ended question.

00:18:02You actually don't know what the answer is.

00:18:04The goal is like,

00:18:05the goal function is not super well-defined.

00:18:07In those cases,

00:18:09closed source models are still state-of-the-art

00:18:11when it comes to reasoning and intelligence.

00:18:13We use closed source models pretty much exclusively

00:18:15for those kinds of use cases.

00:18:16There's a second use case where we have a more niche task

00:18:20with a much clearer goal function.

00:18:22In those cases, we almost always try to fine tune

00:18:25an open source model.

00:18:26We're okay taking a 20% cut hit maybe

00:18:29in terms of reasoning ability

00:18:31so that we can actually fine tune

00:18:33a very, very specific use case.

00:18:35And I think we found that open source models

00:18:37are catching up very, very, very fast.

00:18:39A year and a half ago, it was unthinkable for us

00:18:42to be able to use open source models

00:18:43as part of v0's pipeline.

00:18:45Today, every single part of the pipeline,

00:18:47we're like, okay, can we bring open source models into this?

00:18:49Can we replace what we're doing currently

00:18:52with closed source state-of-the-art frontier models

00:18:55with a fine tune of an open source model?

00:18:57And we've seen a ton of success with Quen, QEMI-K2,

00:19:00other kinds of models like that.

00:19:02- Yeah, I'll call this out as one of the biggest deltas

00:19:05I've seen across everyone,

00:19:07which is at the start of this year,

00:19:08I did a podcast with Ankur from BrainTrust,

00:19:10and he said that open source model usage is roughly 5%

00:19:14across what BrainTrust is seeing, and going down.

00:19:17And now I think reasonably it's gonna go

00:19:19to between the 10 to 20% range for everybody.

00:19:22- I do think it's interesting that even closed source models

00:19:25are investing more heavily into their small class models.

00:19:29The Haikus, GPD5 Minis, Gemini Flashes of the world,

00:19:33which I think also is that model class

00:19:35is what competes with open source the most.

00:19:38It's the small model class competing against a fine tune

00:19:40of an open source model.

00:19:42- And I also think there's some use cases

00:19:43where it's just, it will just be overkill

00:19:46to use a frontier model, and if it is overkill,

00:19:49you are then just gonna obviously be incentivized

00:19:51to use something that's faster and cheaper.

00:19:53And I think part of that, part of I think this delta

00:19:56in terms of percent usage is there is this threshold

00:19:59of when open models cross the threshold of for most tasks,

00:20:04it's actually enough, and then for some niche tasks,

00:20:06you need like the extra firepower.

00:20:10I think we're really getting there

00:20:11with some of these open models,

00:20:12which is why I would suspect

00:20:13we'll see more usage going forward.

00:20:16- Yeah, awesome, that's very encouraging.

00:20:18So we have a bit of time left to prep to you guys

00:20:20with the closing question, which is,

00:20:22what's something that your agents cannot do today

00:20:25that you wish they could do,

00:20:26that they'll probably do it next year?

00:20:27- Am I going first?

00:20:31Okay.

00:20:32Yeah, I think that what we've seen over the last year,

00:20:34just maybe starting as a reference point with 01,

00:20:38a little over a year ago, or 01 preview,

00:20:40what we've seen from then,

00:20:42when I was using very early checkpoints of that model,

00:20:47it was great relative to 40,

00:20:49but still had so much left to be desired.

00:20:51I wouldn't put it, I was on the security team at the time,

00:20:55and there was a lot of work and tasks

00:20:57that I just couldn't delegate to that model.

00:21:00And when we compare it to today,

00:21:01where I can take a pretty well-defined task,

00:21:04like maybe it's like two sentences,

00:21:06a few bullet points to your point,

00:21:07like here are the gotchas

00:21:08that I think you'll probably get stuck on,

00:21:10and then come back and 30 minutes later,

00:21:12an hour later, it's done it.

00:21:14We've seen cases where it's running for many hours,

00:21:17maybe even seven to eight hours,

00:21:19effectively a full workday

00:21:20that I spend a lot of my day in meetings,

00:21:22and so don't necessarily have that solid block of time.

00:21:26But that's only half of what engineering is really about.

00:21:30Part of it is coding, part of it is architecting

00:21:32and troubleshooting and debugging.

00:21:34The other half of the problem is writing docs,

00:21:36is understanding the system, convincing people.

00:21:39And so I think what we'll start to see

00:21:41is this super collaborator where what we want to bring,

00:21:45whether it's in codecs or these other interfaces

00:21:48through the codecs model is the ideal collaborator

00:21:53that you want to work with.

00:21:53The person you first go to, that favorite coworker

00:21:56that you want to jam on ideas with,

00:21:58that's really what we want to see, at least with codecs.

00:22:02I think for us, we've seen a bunch of rapid progression

00:22:05on two different fronts.

00:22:07The first is how many steps can you reasonably expect

00:22:10an agent to be able to do and get reasonably good output?

00:22:14Last year, there's probably one, maybe max three, right?

00:22:17If you wanted reliable output with over 90% success,

00:22:20you're probably running one to three agent steps.

00:22:22Today, most tools run five to 20

00:22:24with no really great reliability rates, over 90% success.

00:22:29I think next year, we're gonna add in

00:22:30sort of that like 100 plus, 200 plus,

00:22:32let's run tons of steps all at once,

00:22:34have long running tasks for multiple hours

00:22:36and be confident that you'll get an output

00:22:38at the end that will be useful.

00:22:40The second is in terms of what resources can be consumed.

00:22:42A year ago, it was whatever you are putting

00:22:44into the prompt form, like that was pretty much it.

00:22:47Today, you can now configure external connections via MCP

00:22:51or by making API calls directly in your application.

00:22:55You can kind of do that if you're knowledgeable,

00:22:57you have the ability to configure things.

00:22:58And I think in a year from now, those will just happen.

00:23:00Like it will just work.

00:23:02The goal is like, you should not need to know

00:23:03what sources of context you need to give the agent.

00:23:06The agent will actually go and find

00:23:08those sources of context proactively.

00:23:09We're kind of starting to see that already today,

00:23:12but I'm still not really confident

00:23:14that's very reliable and useful today.

00:23:16I think by next year, that'll be the default mode.

00:23:18- Yeah, I would agree with that.

00:23:19I think agents can do basically everything today,

00:23:23but the degree to which they do so reliably and proactively

00:23:27is I think the slider that is going to change.

00:23:29But that's a slider that's also dependent on the user.

00:23:31Like if you're a user who's like not really like

00:23:33changing your behavior and meeting the agent where it is,

00:23:36then you might get lower reliability and proactivity.

00:23:38Whereas if you kind of set up your harness correctly

00:23:41or set up your environment correctly,

00:23:42it'll be able to do more of that

00:23:44reliably and more proactively.

00:23:45- Yeah, amazing.

00:23:46Well, we're out of time.

00:23:48My contribution is computer vision.

00:23:49Everyone try Atlas.

00:23:51Everyone try like more computer vision use cases,

00:23:53but thank you so much for your time.

00:23:55- Thank you.

00:23:56(audience applauding)

00:23:57(upbeat music)

Key Takeaway

AI coding agents are rapidly evolving from experimental tools into mainstream productivity multipliers, with success determined by user skill, task clarity, and the emerging convergence of local and cloud capabilities.

Highlights

AI coding agents enable developers to achieve 10x productivity multipliers, with the effectiveness varying based on user skill level and specificity of task definition
The debate between local versus cloud deployment is evolving toward convergence, with resource location and agent loop flexibility becoming key architectural considerations
Open-source models are gaining significant traction (projected to grow from 5% to 10-20% usage) as they approach frontier model capabilities while offering better cost efficiency and privacy benefits
Agent reliability and proactivity are expected to advance dramatically, with multi-hour, 100+ step workflows becoming standard within the next year
Successful agent usage requires specific task definition, clear output specifications, and proper delegation practices—treating agents as amplifiers of existing craft rather than magic solutions
The industry is moving toward agent-native development where junior engineers with strong delegation skills become arbiters between senior engineers' specifications and agent execution
Computer vision capabilities for AI agents remain an underdeveloped frontier that presents significant opportunity for future enhancement

Timeline

Introduction and Company Positioning

The panel introduces three major players in AI coding: Factory (focusing on end-to-end software development agents called 'droids'), OpenAI (developing Codex CLI and cloud solutions), and Vercel (building V0 for preview-driven agentic programming). Each organization explains their strategic approach to bringing AI capabilities into the software development lifecycle. Factory emphasizes handling not just coding but documentation, testing, and review. OpenAI highlights the integration of reasoning models with sandboxing primitives. Vercel describes V0's goal of enabling developers to prompt directly against running applications, representing the future of developer experience.

Local vs. Cloud Deployment Strategy

The panelists discuss the strategic differences between local and cloud agent deployment, with Factory starting local, OpenAI moving from cloud to local, and V0 being cloud-only. They reach consensus that the choice depends on resource location—local for file systems, cloud for remote repositories like GitHub. The fundamental insight is that humans work flexibly across both environments, so agents should similarly converge toward supporting both seamlessly. Factory notes that remote is better for smaller, well-defined tasks, while local suits larger or more complex tasks requiring close monitoring. Vercel adds that agent loops, tool calls, and resources must be independently flexible, with resource availability determining the optimal execution location rather than fixed architectural choices.

Quality, Leverage, and the 'Slop' Question

The panel addresses concerns about whether AI agents generate low-quality 'slop' or provide genuine value. The consensus is that perceived sloppiness depends on guidance quality and task specificity—generic prompts produce generic results, while detailed specifications with test verification yield structured output. Matan uses a documentation example, noting that auto-generated docs for legacy code, while imperfect, provide more value than no documentation. The broader insight is that agents operate as leverage tools: a principal engineer can produce 50,000 lines of code daily with agents versus 5,000 manually, maintaining quality proportional to input specification. V0 emphasizes that agents amplify existing craft skills—they raise the floor for beginners but don't guarantee quality without proper guidance and verification mechanisms.

Enterprise Adoption Patterns and Delegation Skills

Factory reveals an interesting bimodality in enterprise adoption: junior developers are philosophically open to agent-native development but lack delegation experience, while senior engineers are excellent at specifying tasks but resistant to changing their workflow. Vercel observes that enterprises are successfully arbitraging this dynamic, with senior engineers writing detailed specifications as tickets, junior engineers handing them to agents for execution. This creates an unexpected organizational model where technical leadership becomes specification-writing work while implementation shifts to agent execution managed by mid-level engineers. The panel emphasizes that successful agent usage mirrors team management practices—clear specifications, well-defined outputs, and understanding what information agents need to succeed are critical factors.

Open Source Models: Rising Adoption and Strategic Use

The discussion pivots to open-source models, with projections suggesting usage will grow from approximately 5% (start of 2024) to 10-20% by year-end. Matan highlights that open-source models like Qwen performed surprisingly well on Factory's Terminal Bench benchmark, though all top frontier models remain US-based. OpenAI explains their balanced strategy: closed-source models for open-ended reasoning tasks with unclear goal functions, while open-source models excel in niche tasks with clear objectives where fine-tuning enables specialized performance. V0 reports successfully replacing closed-source components with fine-tuned open models in their pipeline over the past year. The panel identifies a critical threshold: as open models cross the 'good enough' line for most tasks, they become the default choice due to cost, privacy, control, and observability benefits, with frontier models reserved for genuinely complex reasoning.

Future Capabilities: Multi-Hour Tasks and Proactive Context Gathering

Panelists discuss their vision for the next year of AI agent development, focusing on two primary advancement areas. First, agent step complexity will scale from today's 5-20 reliable steps to 100+ steps with multi-hour execution windows maintaining over 90% success rates, enabling full workday-equivalent tasks. Second, context gathering will shift from manual specification to proactive agent behavior—agents will automatically identify and fetch necessary resources via MCP connections without requiring users to know which sources are relevant. OpenAI emphasizes the need for agents to evolve from task executors into collaborators and 'favorite coworkers,' handling not just coding but also architecture, debugging, documentation, and persuasion. The panel agrees that the reliability and proactivity slider will continue advancing, though user behavior and environment setup significantly influence achievable performance levels.

Closing Remarks and Computer Vision Opportunities

The moderator closes with remarks emphasizing that agents can accomplish virtually everything today, but reliability and proactivity remain the key advancement vectors. The discussion acknowledges that agent effectiveness depends on both technological capabilities and user adaptation—proper environment setup and behavioral alignment with agent constraints yield significantly higher performance. The moderator highlights computer vision as an underdeveloped frontier, encouraging the audience to explore CV use cases with Atlas. The panel thanks the audience as the session concludes with applause, leaving the impression that practical, real-world agent deployment is rapidly maturing while future advances will focus on reliability, autonomy, and capability expansion.

Community Posts

Write about this video