Forget Codex vs Claude Code, Goal Buddy Finally Fixes Both
AAI LABS
Computing/SoftwareSmall Business/StartupsInternet Technology
Transcript
00:00:00this is gary the snail and he's identified a market gap to build a dating platform for snails
00:00:04but since he's super slow he wants claude code to autonomously handle his long-running tasks
00:00:09fortunately for him agents have gotten really good at long-running tasks claude code has a
00:00:13goal command that just keeps the agent running until the task gets completed but during our
00:00:18testing we found out a lot of issues with the goal command since gary recently went through a
00:00:22divorce and we want him to be happy we found this open source tool that actually fixes the problem
00:00:28and it doesn't only work with claude code but codex as well spreading love just like your mom who i'm
00:00:32sure loves you just as much as your employed sibling claude code previously released a command called
00:00:38goal which keeps the agent working until a certain condition is met we didn't cover this one on our
00:00:42channel but you probably already know about it before this there was a plugin called ralph wiggum
00:00:47that gained a lot of traction which essentially did the same thing it used hooks to feed the prompt
00:00:52back to claude code until the condition was actually met but the thing is these conditions need to be an
00:00:57exact match because the ralph loop uses a shell script to check for the condition literally like
00:01:02the airport guard who doesn't let you through because your manly body spray is over the baggage
00:01:06limit the goal command works differently it takes the condition and the conversation so far and gives
00:01:11it to a small model which is haiku and this model intelligently evaluates if the task is done or not
00:01:17it returns a yes or no decision and a no tells claude to keep iterating on the same task like when your boss
00:01:22tells you to improve the user experience because he just can't find a button on the page so this makes
00:01:27the evaluation subjective and for things that we cannot quantify on their own that's a real improvement
00:01:32the goal does work well for a lot of tasks but it still has a lot of issues the first issue is that
00:01:37it does not use any knowledge base or file system that tracks the progress of the task and since it's
00:01:42not doing that the only source of truth for the agent becomes the chat context this might trigger
00:01:47you since it was your dad who wrote the crypto fortune on a sticky note that fell off the fridge back in
00:01:522017. once the session ends for any reason and the goal wasn't completed you sure can resume it using
00:01:58the claude resume command the goal will not be lost but the only way it knows where it left off is the
00:02:03chat context and since this command is meant for long-running tasks not simple ones things can get
00:02:08messed up in between and of course with the goal running for hours context bloat and hitting compaction
00:02:13is bound to become a real problem at some point after compaction the agent's output gets worse
00:02:18it's going to start behaving like my grandma who because of her dementia is starting to forget this
00:02:22channel's name i need you guys to watch the last video for her another problem is that it doesn't
00:02:27break tasks down into smaller ones instead it just uses the main agent and does the task breakdown
00:02:32on its own the way claude code normally does so there's no structured plan and the agent may lose track
00:02:37of what's left to do and even though this might work well for some cases an unclear
00:02:42definition of what done looks like for agents is never the right thing the goal relies entirely on
00:02:47the model to evaluate completion so it might not be as effective in some cases it is better than
00:02:52ralph wiggum being completely strict by using scripts but at least there should be some metric
00:02:56that tells the agent what done might look like just like your wedding photographer that kept saying
00:03:01one more shot until the whole event was over so this is where the goal falls short and these things
00:03:05might not look like much but when put into real heavy workflows they can bring some serious issues
00:03:10now goal buddy is a tool that was built with one purpose to make the goal command actually work the
00:03:16way it should it solves all the problems we just talked about but it's not really getting as much
00:03:20attention as it should given how useful it is it's like the hot babysitter except instead of flirting
00:03:25with you she's just babysitting your long-running tasks goal doesn't preserve the state of the work
00:03:30locally so this tool fixes that and actually forces the goal to read and update local state instead of relying on
00:03:36chat history and it also finishes with proof so the agent actually knows what done looks like before
00:03:42it starts in order to track progress it also includes a whole dashboard where you can watch
00:03:46your agent work while it's working and to handle all this it's built upon three agents which are the
00:03:51scout the worker and the judge basically a y combinator startup team where one does all the work one
00:03:56watches him do it and one judges both of them on twitter the installation is pretty straightforward just
00:04:01copy the install command and paste it into your project folder it will be installed as a plugin
00:04:06available for both claude code and codex once you start a new session you can see the command
00:04:10available for use so these three agents each have a strictly defined role and access level since this
00:04:16tool is built for codex as well the agents are defined in toml instead of the standard markdown the
00:04:21first agent is the judge which only has read access it skeptically analyzes hard decisions like risky
00:04:26scope contradictory sources and other patterns to make sure the task is completed safely its
00:04:31instructions forbid editing because it exists only for making judgments nothing else and since its
00:04:36task is highly critical this agent's reasoning is set to the highest so that decisions are made properly
00:04:42it's exactly like when you've been composing that one text to your crush for four hours straight in
00:04:47the middle of the night after it finishes working it returns a jason structure with the approved and
00:04:52rejected decisions along with the rationale the scout is another read-only agent that maps an active task
00:04:57and creates a compact evidence receipt for it since its job is just to check the state of the task
00:05:02its reasoning effort is kept low just like your favorite strip clubs bouncer it doesn't actually care
00:05:07that much and then there's the worker agent the only one with edit access it does the actual work and
00:05:12it's only allowed to execute one task at a time there's also the pm role which is the main thread that
00:05:17coordinates the workflow it behaves like an actual project manager doing the minimal work possible
00:05:22it's the only authority that can actually mark the task as done the core workflow starts by expressing
00:05:27the intent of the task in proper words not vaguely the way us homo sapiens usually do but in a way the
00:05:33agent can properly understand and then the oracle is defined the oracle is basically an observable
00:05:38signal that identifies the outcome it is what the system iterates against to see if the task can be
00:05:43marked as done or not it could be anything a test suite a browser rundown any artifact benchmarks or the code
00:05:49that turns my microwave into a time machine because why not ai agents are doing anything at this point
00:05:54then the next step is surface it breaks down the task into actionable steps creates the dashboard and maps
00:06:00the tasks into a visual format the last piece is the pm he's the manager in this case and keeps the goal running
00:06:06until the final audit marks the goal is met to use goal buddy you just run the goal prep command
00:06:11this is the one that initializes the workflow and you define the goal that you want it to achieve it
00:06:16first ensures the agents are installed and ready to be used it then initiates the workflow but unlike
00:06:21the native goal command it's extremely self-conscious and it first removes its own ambiguities by asking
00:06:27you questions so that you can clearly define the implementation and just like your suspicious wife
00:06:32it will keep asking questions until it has understood the first step focuses on creating the goal files it places
00:06:38the original request along with our answers and then maps it to the proper objective in agent
00:06:43understandable language it contains a summary of all the information and then defines the oracle
00:06:48which is the most important part the oracle for this task is straightforward all tests must pass with
00:06:53proper behavior this kind of goal is specific because it can be programmatically
00:06:57evaluated unlike your cover story last night that your wife is totally not buying goal buddy breaks down the whole workflow
00:07:03into small doable tasks these are called slices but unlike the real world size doesn't matter here
00:07:08because a small slice doesn't mean a small task it means something that is safe can be verified easily
00:07:14and can be run individually it explicitly defines the safe slicing size in the document as well it creates
00:07:19the state.yaml which tracks the project and tasks and defines how the pm loop would look the state.yaml consists of
00:07:26all the goals and rules with all the tasks broken down by their ids and the assigned agent it contains
00:07:31a field for tracking the active task too it also mentions the linked dashboard it lists all the to-do
00:07:36tasks and the in progress tasks in our case the scout agent is currently in progress and is mapping all
00:07:42the files and endpoints so to start the loop you just copy this command and run it it instructs claude to
00:07:47set the goal of doing everything in the goal.md file from there it will pick up the first active
00:07:52task like a king and then call out its subordinate agents to perform it once the scout has completed
00:07:58the work it updates the progress file with all its findings and documents them in a separate directory
00:08:03it also updates the board from active to completed then the loop picks up the next task marks it as
00:08:08active and starts the judge agent the judge critically reviews the findings and sequences the report
00:08:13into the fewest possible vertical slices which is the task breakdown for the worker to carry out
00:08:18independently it then updates the slice count and updates the state file accordingly each task
00:08:22explicitly lists the allowed files how to verify them and when to stop this is how it defines each slice
00:08:28so that agents have a clear expected output checks and all the necessary details then one by one it
00:08:33initializes the worker agent and begins with the first slice the progress of each agent can be tracked
00:08:39using the dashboard you'll know what each task is doing which agent is active what tasks are queued and
00:08:44which ones are completed so you don't have to monitor things yourself and can actually give your kids
00:08:48the time that they need once all the tasks have been completed it performs the last audit as pm
00:08:53making sure that all the tests have been properly conducted once the audit is done it marks the judge
00:08:58agent's final audit task as done and then marks the goal as completed after this you have to start
00:09:03the prayers and hope that those agents didn't hallucinate overall this worked considerably well given the
00:09:09complexity and the scale of the app we gave it but we think more effective parallelization could be
00:09:13added because it did everything sequentially it handled one task at a time and didn't make use of
00:09:18claude code's parallelization capabilities at all dario would have been actually disappointed to see this
00:09:23but given how well it planned the workflow it did work pretty well also if you are enjoying our content
00:09:28consider pressing the hype button because it helps us create more content like this and reach out to more
00:09:33people we also wanted to test goldbuddy on something more generic like designing a ui to see how it
00:09:38handles tasks that can't be evaluated programmatically the previous test was on a specific workflow with
00:09:44clear pass and fail criteria but just like you getting that fresh cut from your barber some tasks
00:09:49just don't have that so we first gave the usual goal command a vague prompt it initialized the goal
00:09:54tasks consulted the advisor and gave a website in no time being lazy it just created a simple html page
00:10:00and didn't go for any framework but the landing page didn't look bad so we gave the same exact prompt to
00:10:05goal buddy as well once it started it followed the same workflow and gave a similar questioning session
00:10:10to clarify the intent with us here goal buddy actually asked for the tech stack as well normally
00:10:14i'd call this kissing but since i take my ai agent seriously i'll call it being thorough similarly it
00:10:20created the board and the goal.md file and translated our original request into a proper objective it also
00:10:26properly identified the oracle but the oracle in the previous task was simple it just needed to pass all the
00:10:31tests this one had different goals it defined the task as complete when the dev server would be up and
00:10:36running and browser walkthroughs confirm all the sections work as defined this is how it turned a
00:10:41non-quantifiable task into something quantifiable it also created the state.yaml again with the oracle
00:10:47rules agents and all the tasks listed out and then started working in the same way it took a longer
00:10:52time than the normal goal command but it ended up implementing the app properly this won't be a
00:10:57problem for gary the snail but you should do some push-ups in the meantime i can see you've gotten fat
00:11:02comparatively the whole website performed significantly better than what the simple goal command created
00:11:07if you're actually want to be an ai b2b sas founder who likes to build instead of just watching tutorials
00:11:12then you should be an ai labs pro you'll actually get like-minded nerds like our team in there with
00:11:17resources from the videos and lots of other goodies as well the link's going to be in the description and
00:11:22you can check that out that brings us to the end of this video if you'd like to support the channel
00:11:27and help us keep making videos like this you can do so by using the super thanks button below as always
00:11:32thank you for watching and i'll see you in the next one
Community Posts
No posts yet. Be the first to write about this video!
Write about this video