Forget Codex vs Claude Code, Goal Buddy Finally Fixes Both

AAI LABS
Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00this is gary the snail and he's identified a market gap to build a dating platform for snails
00:00:04but since he's super slow he wants claude code to autonomously handle his long-running tasks
00:00:09fortunately for him agents have gotten really good at long-running tasks claude code has a
00:00:13goal command that just keeps the agent running until the task gets completed but during our
00:00:18testing we found out a lot of issues with the goal command since gary recently went through a
00:00:22divorce and we want him to be happy we found this open source tool that actually fixes the problem
00:00:28and it doesn't only work with claude code but codex as well spreading love just like your mom who i'm
00:00:32sure loves you just as much as your employed sibling claude code previously released a command called
00:00:38goal which keeps the agent working until a certain condition is met we didn't cover this one on our
00:00:42channel but you probably already know about it before this there was a plugin called ralph wiggum
00:00:47that gained a lot of traction which essentially did the same thing it used hooks to feed the prompt
00:00:52back to claude code until the condition was actually met but the thing is these conditions need to be an
00:00:57exact match because the ralph loop uses a shell script to check for the condition literally like
00:01:02the airport guard who doesn't let you through because your manly body spray is over the baggage
00:01:06limit the goal command works differently it takes the condition and the conversation so far and gives
00:01:11it to a small model which is haiku and this model intelligently evaluates if the task is done or not
00:01:17it returns a yes or no decision and a no tells claude to keep iterating on the same task like when your boss
00:01:22tells you to improve the user experience because he just can't find a button on the page so this makes
00:01:27the evaluation subjective and for things that we cannot quantify on their own that's a real improvement
00:01:32the goal does work well for a lot of tasks but it still has a lot of issues the first issue is that
00:01:37it does not use any knowledge base or file system that tracks the progress of the task and since it's
00:01:42not doing that the only source of truth for the agent becomes the chat context this might trigger
00:01:47you since it was your dad who wrote the crypto fortune on a sticky note that fell off the fridge back in
00:01:522017. once the session ends for any reason and the goal wasn't completed you sure can resume it using
00:01:58the claude resume command the goal will not be lost but the only way it knows where it left off is the
00:02:03chat context and since this command is meant for long-running tasks not simple ones things can get
00:02:08messed up in between and of course with the goal running for hours context bloat and hitting compaction
00:02:13is bound to become a real problem at some point after compaction the agent's output gets worse
00:02:18it's going to start behaving like my grandma who because of her dementia is starting to forget this
00:02:22channel's name i need you guys to watch the last video for her another problem is that it doesn't
00:02:27break tasks down into smaller ones instead it just uses the main agent and does the task breakdown
00:02:32on its own the way claude code normally does so there's no structured plan and the agent may lose track
00:02:37of what's left to do and even though this might work well for some cases an unclear
00:02:42definition of what done looks like for agents is never the right thing the goal relies entirely on
00:02:47the model to evaluate completion so it might not be as effective in some cases it is better than
00:02:52ralph wiggum being completely strict by using scripts but at least there should be some metric
00:02:56that tells the agent what done might look like just like your wedding photographer that kept saying
00:03:01one more shot until the whole event was over so this is where the goal falls short and these things
00:03:05might not look like much but when put into real heavy workflows they can bring some serious issues
00:03:10now goal buddy is a tool that was built with one purpose to make the goal command actually work the
00:03:16way it should it solves all the problems we just talked about but it's not really getting as much
00:03:20attention as it should given how useful it is it's like the hot babysitter except instead of flirting
00:03:25with you she's just babysitting your long-running tasks goal doesn't preserve the state of the work
00:03:30locally so this tool fixes that and actually forces the goal to read and update local state instead of relying on
00:03:36chat history and it also finishes with proof so the agent actually knows what done looks like before
00:03:42it starts in order to track progress it also includes a whole dashboard where you can watch
00:03:46your agent work while it's working and to handle all this it's built upon three agents which are the
00:03:51scout the worker and the judge basically a y combinator startup team where one does all the work one
00:03:56watches him do it and one judges both of them on twitter the installation is pretty straightforward just
00:04:01copy the install command and paste it into your project folder it will be installed as a plugin
00:04:06available for both claude code and codex once you start a new session you can see the command
00:04:10available for use so these three agents each have a strictly defined role and access level since this
00:04:16tool is built for codex as well the agents are defined in toml instead of the standard markdown the
00:04:21first agent is the judge which only has read access it skeptically analyzes hard decisions like risky
00:04:26scope contradictory sources and other patterns to make sure the task is completed safely its
00:04:31instructions forbid editing because it exists only for making judgments nothing else and since its
00:04:36task is highly critical this agent's reasoning is set to the highest so that decisions are made properly
00:04:42it's exactly like when you've been composing that one text to your crush for four hours straight in
00:04:47the middle of the night after it finishes working it returns a jason structure with the approved and
00:04:52rejected decisions along with the rationale the scout is another read-only agent that maps an active task
00:04:57and creates a compact evidence receipt for it since its job is just to check the state of the task
00:05:02its reasoning effort is kept low just like your favorite strip clubs bouncer it doesn't actually care
00:05:07that much and then there's the worker agent the only one with edit access it does the actual work and
00:05:12it's only allowed to execute one task at a time there's also the pm role which is the main thread that
00:05:17coordinates the workflow it behaves like an actual project manager doing the minimal work possible
00:05:22it's the only authority that can actually mark the task as done the core workflow starts by expressing
00:05:27the intent of the task in proper words not vaguely the way us homo sapiens usually do but in a way the
00:05:33agent can properly understand and then the oracle is defined the oracle is basically an observable
00:05:38signal that identifies the outcome it is what the system iterates against to see if the task can be
00:05:43marked as done or not it could be anything a test suite a browser rundown any artifact benchmarks or the code
00:05:49that turns my microwave into a time machine because why not ai agents are doing anything at this point
00:05:54then the next step is surface it breaks down the task into actionable steps creates the dashboard and maps
00:06:00the tasks into a visual format the last piece is the pm he's the manager in this case and keeps the goal running
00:06:06until the final audit marks the goal is met to use goal buddy you just run the goal prep command
00:06:11this is the one that initializes the workflow and you define the goal that you want it to achieve it
00:06:16first ensures the agents are installed and ready to be used it then initiates the workflow but unlike
00:06:21the native goal command it's extremely self-conscious and it first removes its own ambiguities by asking
00:06:27you questions so that you can clearly define the implementation and just like your suspicious wife
00:06:32it will keep asking questions until it has understood the first step focuses on creating the goal files it places
00:06:38the original request along with our answers and then maps it to the proper objective in agent
00:06:43understandable language it contains a summary of all the information and then defines the oracle
00:06:48which is the most important part the oracle for this task is straightforward all tests must pass with
00:06:53proper behavior this kind of goal is specific because it can be programmatically
00:06:57evaluated unlike your cover story last night that your wife is totally not buying goal buddy breaks down the whole workflow
00:07:03into small doable tasks these are called slices but unlike the real world size doesn't matter here
00:07:08because a small slice doesn't mean a small task it means something that is safe can be verified easily
00:07:14and can be run individually it explicitly defines the safe slicing size in the document as well it creates
00:07:19the state.yaml which tracks the project and tasks and defines how the pm loop would look the state.yaml consists of
00:07:26all the goals and rules with all the tasks broken down by their ids and the assigned agent it contains
00:07:31a field for tracking the active task too it also mentions the linked dashboard it lists all the to-do
00:07:36tasks and the in progress tasks in our case the scout agent is currently in progress and is mapping all
00:07:42the files and endpoints so to start the loop you just copy this command and run it it instructs claude to
00:07:47set the goal of doing everything in the goal.md file from there it will pick up the first active
00:07:52task like a king and then call out its subordinate agents to perform it once the scout has completed
00:07:58the work it updates the progress file with all its findings and documents them in a separate directory
00:08:03it also updates the board from active to completed then the loop picks up the next task marks it as
00:08:08active and starts the judge agent the judge critically reviews the findings and sequences the report
00:08:13into the fewest possible vertical slices which is the task breakdown for the worker to carry out
00:08:18independently it then updates the slice count and updates the state file accordingly each task
00:08:22explicitly lists the allowed files how to verify them and when to stop this is how it defines each slice
00:08:28so that agents have a clear expected output checks and all the necessary details then one by one it
00:08:33initializes the worker agent and begins with the first slice the progress of each agent can be tracked
00:08:39using the dashboard you'll know what each task is doing which agent is active what tasks are queued and
00:08:44which ones are completed so you don't have to monitor things yourself and can actually give your kids
00:08:48the time that they need once all the tasks have been completed it performs the last audit as pm
00:08:53making sure that all the tests have been properly conducted once the audit is done it marks the judge
00:08:58agent's final audit task as done and then marks the goal as completed after this you have to start
00:09:03the prayers and hope that those agents didn't hallucinate overall this worked considerably well given the
00:09:09complexity and the scale of the app we gave it but we think more effective parallelization could be
00:09:13added because it did everything sequentially it handled one task at a time and didn't make use of
00:09:18claude code's parallelization capabilities at all dario would have been actually disappointed to see this
00:09:23but given how well it planned the workflow it did work pretty well also if you are enjoying our content
00:09:28consider pressing the hype button because it helps us create more content like this and reach out to more
00:09:33people we also wanted to test goldbuddy on something more generic like designing a ui to see how it
00:09:38handles tasks that can't be evaluated programmatically the previous test was on a specific workflow with
00:09:44clear pass and fail criteria but just like you getting that fresh cut from your barber some tasks
00:09:49just don't have that so we first gave the usual goal command a vague prompt it initialized the goal
00:09:54tasks consulted the advisor and gave a website in no time being lazy it just created a simple html page
00:10:00and didn't go for any framework but the landing page didn't look bad so we gave the same exact prompt to
00:10:05goal buddy as well once it started it followed the same workflow and gave a similar questioning session
00:10:10to clarify the intent with us here goal buddy actually asked for the tech stack as well normally
00:10:14i'd call this kissing but since i take my ai agent seriously i'll call it being thorough similarly it
00:10:20created the board and the goal.md file and translated our original request into a proper objective it also
00:10:26properly identified the oracle but the oracle in the previous task was simple it just needed to pass all the
00:10:31tests this one had different goals it defined the task as complete when the dev server would be up and
00:10:36running and browser walkthroughs confirm all the sections work as defined this is how it turned a
00:10:41non-quantifiable task into something quantifiable it also created the state.yaml again with the oracle
00:10:47rules agents and all the tasks listed out and then started working in the same way it took a longer
00:10:52time than the normal goal command but it ended up implementing the app properly this won't be a
00:10:57problem for gary the snail but you should do some push-ups in the meantime i can see you've gotten fat
00:11:02comparatively the whole website performed significantly better than what the simple goal command created
00:11:07if you're actually want to be an ai b2b sas founder who likes to build instead of just watching tutorials
00:11:12then you should be an ai labs pro you'll actually get like-minded nerds like our team in there with
00:11:17resources from the videos and lots of other goodies as well the link's going to be in the description and
00:11:22you can check that out that brings us to the end of this video if you'd like to support the channel
00:11:27and help us keep making videos like this you can do so by using the super thanks button below as always
00:11:32thank you for watching and i'll see you in the next one

Key Takeaway

Goal Buddy fixes the state persistence and evaluation ambiguity of Claude Code's native goal command by introducing a structured, multi-agent framework that maps tasks to local YAML-based state and quantifiable outcome oracles.

Highlights

  • Claude Code's native goal command relies solely on chat context and small model evaluation, leading to context bloat and potential loss of task state during long-running operations.

  • Goal Buddy functions as a plugin for Claude Code and Codex, forcing agents to read and update local state via state.yaml to resolve state persistence issues.

  • The Goal Buddy system architecture utilizes three specialized agents: a judge for skeptical analysis, a scout for state mapping, and a worker for task execution.

  • Goal Buddy implements an 'oracle' mechanism—an observable signal like test suites or browser walkthroughs—to programmatically define completion criteria.

  • Unlike the standard goal command, Goal Buddy decomposes tasks into 'slices' that are verified individually, increasing safety and observability in complex workflows.

  • Installation requires adding Goal Buddy as a plugin and using the 'goal prep' command to initialize the project-specific agent workflow.

Timeline

Limitations of Native Goal Commands

  • Claude Code's goal command struggles with context bloat during extended operations.
  • The lack of local state tracking makes the chat history the only source of truth for the agent.
  • Relying on a small model to judge task completion introduces subjective, unquantifiable evaluation metrics.

The native goal command uses a small model, Haiku, to evaluate completion status, which is often insufficient for long-running, complex tasks. Without a persistent local file system or knowledge base to track progress, agents suffer from context limits and performance degradation following chat compaction. Furthermore, the absence of task decomposition leads to unclear definitions of what 'done' looks like for the agent.

Goal Buddy Architecture and Agents

  • Goal Buddy coordinates three specialized roles: a judge, a scout, and a worker.
  • The system persists work state locally using YAML files instead of relying on chat history.
  • An oracle signal provides a clear, verifiable definition of task success before execution begins.

Goal Buddy replaces unstructured workflows with a three-agent team. The judge agent maintains read-only access to critically review decisions, the scout maps active tasks into evidence receipts, and the worker executes single tasks at a time. The system's PM role coordinates these agents and ensures the workflow adheres to the objective defined by the oracle, which could be a test suite or specific code output.

Operational Workflow and Implementation

  • The 'goal prep' command initializes the workflow and removes ambiguity by querying the user for specifics.
  • Workflow progress is managed through 'slices,' which are small, easily verifiable tasks defined in a state.yaml file.
  • Goal Buddy performs significantly better than standard goal commands on vague, non-quantifiable tasks like UI design by forcing precise requirement gathering.

To start, Goal Buddy creates a goal.md file and a state.yaml file to track project IDs, active tasks, and completion rules. It runs tasks sequentially, updating a dashboard in real-time so users can monitor progress. Testing shows that while Goal Buddy takes longer to execute than native commands, it provides higher-quality, more reliable outcomes by forcing the agent to align with specific, quantifiable success criteria.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video