Forget Codex vs Claude Code, Goal Buddy Finally Fixes Both

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00this is gary the snail and he's identified a market gap to build a dating platform for snails

00:00:04but since he's super slow he wants claude code to autonomously handle his long-running tasks

00:00:09fortunately for him agents have gotten really good at long-running tasks claude code has a

00:00:13goal command that just keeps the agent running until the task gets completed but during our

00:00:18testing we found out a lot of issues with the goal command since gary recently went through a

00:00:22divorce and we want him to be happy we found this open source tool that actually fixes the problem

00:00:28and it doesn't only work with claude code but codex as well spreading love just like your mom who i'm

00:00:32sure loves you just as much as your employed sibling claude code previously released a command called

00:00:38goal which keeps the agent working until a certain condition is met we didn't cover this one on our

00:00:42channel but you probably already know about it before this there was a plugin called ralph wiggum

00:00:47that gained a lot of traction which essentially did the same thing it used hooks to feed the prompt

00:00:52back to claude code until the condition was actually met but the thing is these conditions need to be an

00:00:57exact match because the ralph loop uses a shell script to check for the condition literally like

00:01:02the airport guard who doesn't let you through because your manly body spray is over the baggage

00:01:06limit the goal command works differently it takes the condition and the conversation so far and gives

00:01:11it to a small model which is haiku and this model intelligently evaluates if the task is done or not

00:01:17it returns a yes or no decision and a no tells claude to keep iterating on the same task like when your boss

00:01:22tells you to improve the user experience because he just can't find a button on the page so this makes

00:01:27the evaluation subjective and for things that we cannot quantify on their own that's a real improvement

00:01:32the goal does work well for a lot of tasks but it still has a lot of issues the first issue is that

00:01:37it does not use any knowledge base or file system that tracks the progress of the task and since it's

00:01:42not doing that the only source of truth for the agent becomes the chat context this might trigger

00:01:47you since it was your dad who wrote the crypto fortune on a sticky note that fell off the fridge back in

00:01:522017. once the session ends for any reason and the goal wasn't completed you sure can resume it using

00:01:58the claude resume command the goal will not be lost but the only way it knows where it left off is the

00:02:03chat context and since this command is meant for long-running tasks not simple ones things can get

00:02:08messed up in between and of course with the goal running for hours context bloat and hitting compaction

00:02:13is bound to become a real problem at some point after compaction the agent's output gets worse

00:02:18it's going to start behaving like my grandma who because of her dementia is starting to forget this

00:02:22channel's name i need you guys to watch the last video for her another problem is that it doesn't

00:02:27break tasks down into smaller ones instead it just uses the main agent and does the task breakdown

00:02:32on its own the way claude code normally does so there's no structured plan and the agent may lose track

00:02:37of what's left to do and even though this might work well for some cases an unclear

00:02:42definition of what done looks like for agents is never the right thing the goal relies entirely on

00:02:47the model to evaluate completion so it might not be as effective in some cases it is better than

00:02:52ralph wiggum being completely strict by using scripts but at least there should be some metric

00:02:56that tells the agent what done might look like just like your wedding photographer that kept saying

00:03:01one more shot until the whole event was over so this is where the goal falls short and these things

00:03:05might not look like much but when put into real heavy workflows they can bring some serious issues

00:03:10now goal buddy is a tool that was built with one purpose to make the goal command actually work the

00:03:16way it should it solves all the problems we just talked about but it's not really getting as much

00:03:20attention as it should given how useful it is it's like the hot babysitter except instead of flirting

00:03:25with you she's just babysitting your long-running tasks goal doesn't preserve the state of the work

00:03:30locally so this tool fixes that and actually forces the goal to read and update local state instead of relying on

00:03:36chat history and it also finishes with proof so the agent actually knows what done looks like before

00:03:42it starts in order to track progress it also includes a whole dashboard where you can watch

00:03:46your agent work while it's working and to handle all this it's built upon three agents which are the

00:03:51scout the worker and the judge basically a y combinator startup team where one does all the work one

00:03:56watches him do it and one judges both of them on twitter the installation is pretty straightforward just

00:04:01copy the install command and paste it into your project folder it will be installed as a plugin

00:04:06available for both claude code and codex once you start a new session you can see the command

00:04:10available for use so these three agents each have a strictly defined role and access level since this

00:04:16tool is built for codex as well the agents are defined in toml instead of the standard markdown the

00:04:21first agent is the judge which only has read access it skeptically analyzes hard decisions like risky

00:04:26scope contradictory sources and other patterns to make sure the task is completed safely its

00:04:31instructions forbid editing because it exists only for making judgments nothing else and since its

00:04:36task is highly critical this agent's reasoning is set to the highest so that decisions are made properly

00:04:42it's exactly like when you've been composing that one text to your crush for four hours straight in

00:04:47the middle of the night after it finishes working it returns a jason structure with the approved and

00:04:52rejected decisions along with the rationale the scout is another read-only agent that maps an active task

00:04:57and creates a compact evidence receipt for it since its job is just to check the state of the task

00:05:02its reasoning effort is kept low just like your favorite strip clubs bouncer it doesn't actually care

00:05:07that much and then there's the worker agent the only one with edit access it does the actual work and

00:05:12it's only allowed to execute one task at a time there's also the pm role which is the main thread that

00:05:17coordinates the workflow it behaves like an actual project manager doing the minimal work possible

00:05:22it's the only authority that can actually mark the task as done the core workflow starts by expressing

00:05:27the intent of the task in proper words not vaguely the way us homo sapiens usually do but in a way the

00:05:33agent can properly understand and then the oracle is defined the oracle is basically an observable

00:05:38signal that identifies the outcome it is what the system iterates against to see if the task can be

00:05:43marked as done or not it could be anything a test suite a browser rundown any artifact benchmarks or the code

00:05:49that turns my microwave into a time machine because why not ai agents are doing anything at this point

00:05:54then the next step is surface it breaks down the task into actionable steps creates the dashboard and maps

00:06:00the tasks into a visual format the last piece is the pm he's the manager in this case and keeps the goal running

00:06:06until the final audit marks the goal is met to use goal buddy you just run the goal prep command

00:06:11this is the one that initializes the workflow and you define the goal that you want it to achieve it

00:06:16first ensures the agents are installed and ready to be used it then initiates the workflow but unlike

00:06:21the native goal command it's extremely self-conscious and it first removes its own ambiguities by asking

00:06:27you questions so that you can clearly define the implementation and just like your suspicious wife

00:06:32it will keep asking questions until it has understood the first step focuses on creating the goal files it places

00:06:38the original request along with our answers and then maps it to the proper objective in agent

00:06:43understandable language it contains a summary of all the information and then defines the oracle

00:06:48which is the most important part the oracle for this task is straightforward all tests must pass with

00:06:53proper behavior this kind of goal is specific because it can be programmatically

00:06:57evaluated unlike your cover story last night that your wife is totally not buying goal buddy breaks down the whole workflow

00:07:03into small doable tasks these are called slices but unlike the real world size doesn't matter here

00:07:08because a small slice doesn't mean a small task it means something that is safe can be verified easily

00:07:14and can be run individually it explicitly defines the safe slicing size in the document as well it creates

00:07:19the state.yaml which tracks the project and tasks and defines how the pm loop would look the state.yaml consists of

00:07:26all the goals and rules with all the tasks broken down by their ids and the assigned agent it contains

00:07:31a field for tracking the active task too it also mentions the linked dashboard it lists all the to-do

00:07:36tasks and the in progress tasks in our case the scout agent is currently in progress and is mapping all

00:07:42the files and endpoints so to start the loop you just copy this command and run it it instructs claude to

00:07:47set the goal of doing everything in the goal.md file from there it will pick up the first active

00:07:52task like a king and then call out its subordinate agents to perform it once the scout has completed

00:07:58the work it updates the progress file with all its findings and documents them in a separate directory

00:08:03it also updates the board from active to completed then the loop picks up the next task marks it as

00:08:08active and starts the judge agent the judge critically reviews the findings and sequences the report

00:08:13into the fewest possible vertical slices which is the task breakdown for the worker to carry out

00:08:18independently it then updates the slice count and updates the state file accordingly each task

00:08:22explicitly lists the allowed files how to verify them and when to stop this is how it defines each slice

00:08:28so that agents have a clear expected output checks and all the necessary details then one by one it

00:08:33initializes the worker agent and begins with the first slice the progress of each agent can be tracked

00:08:39using the dashboard you'll know what each task is doing which agent is active what tasks are queued and

00:08:44which ones are completed so you don't have to monitor things yourself and can actually give your kids

00:08:48the time that they need once all the tasks have been completed it performs the last audit as pm

00:08:53making sure that all the tests have been properly conducted once the audit is done it marks the judge

00:08:58agent's final audit task as done and then marks the goal as completed after this you have to start

00:09:03the prayers and hope that those agents didn't hallucinate overall this worked considerably well given the

00:09:09complexity and the scale of the app we gave it but we think more effective parallelization could be

00:09:13added because it did everything sequentially it handled one task at a time and didn't make use of

00:09:18claude code's parallelization capabilities at all dario would have been actually disappointed to see this

00:09:23but given how well it planned the workflow it did work pretty well also if you are enjoying our content

00:09:28consider pressing the hype button because it helps us create more content like this and reach out to more

00:09:33people we also wanted to test goldbuddy on something more generic like designing a ui to see how it

00:09:38handles tasks that can't be evaluated programmatically the previous test was on a specific workflow with

00:09:44clear pass and fail criteria but just like you getting that fresh cut from your barber some tasks

00:09:49just don't have that so we first gave the usual goal command a vague prompt it initialized the goal

00:09:54tasks consulted the advisor and gave a website in no time being lazy it just created a simple html page

00:10:00and didn't go for any framework but the landing page didn't look bad so we gave the same exact prompt to

00:10:05goal buddy as well once it started it followed the same workflow and gave a similar questioning session

00:10:10to clarify the intent with us here goal buddy actually asked for the tech stack as well normally

00:10:14i'd call this kissing but since i take my ai agent seriously i'll call it being thorough similarly it

00:10:20created the board and the goal.md file and translated our original request into a proper objective it also

00:10:26properly identified the oracle but the oracle in the previous task was simple it just needed to pass all the

00:10:31tests this one had different goals it defined the task as complete when the dev server would be up and

00:10:36running and browser walkthroughs confirm all the sections work as defined this is how it turned a

00:10:41non-quantifiable task into something quantifiable it also created the state.yaml again with the oracle

00:10:47rules agents and all the tasks listed out and then started working in the same way it took a longer

00:10:52time than the normal goal command but it ended up implementing the app properly this won't be a

00:10:57problem for gary the snail but you should do some push-ups in the meantime i can see you've gotten fat

00:11:02comparatively the whole website performed significantly better than what the simple goal command created

00:11:07if you're actually want to be an ai b2b sas founder who likes to build instead of just watching tutorials

00:11:12then you should be an ai labs pro you'll actually get like-minded nerds like our team in there with

00:11:17resources from the videos and lots of other goodies as well the link's going to be in the description and

00:11:22you can check that out that brings us to the end of this video if you'd like to support the channel

00:11:27and help us keep making videos like this you can do so by using the super thanks button below as always

00:11:32thank you for watching and i'll see you in the next one

Key Takeaway

Goal Buddy fixes the state persistence and evaluation ambiguity of Claude Code's native goal command by introducing a structured, multi-agent framework that maps tasks to local YAML-based state and quantifiable outcome oracles.

Highlights

Claude Code's native goal command relies solely on chat context and small model evaluation, leading to context bloat and potential loss of task state during long-running operations.
Goal Buddy functions as a plugin for Claude Code and Codex, forcing agents to read and update local state via state.yaml to resolve state persistence issues.
The Goal Buddy system architecture utilizes three specialized agents: a judge for skeptical analysis, a scout for state mapping, and a worker for task execution.
Goal Buddy implements an 'oracle' mechanism—an observable signal like test suites or browser walkthroughs—to programmatically define completion criteria.
Unlike the standard goal command, Goal Buddy decomposes tasks into 'slices' that are verified individually, increasing safety and observability in complex workflows.
Installation requires adding Goal Buddy as a plugin and using the 'goal prep' command to initialize the project-specific agent workflow.

Timeline

Limitations of Native Goal Commands

Claude Code's goal command struggles with context bloat during extended operations.
The lack of local state tracking makes the chat history the only source of truth for the agent.
Relying on a small model to judge task completion introduces subjective, unquantifiable evaluation metrics.

The native goal command uses a small model, Haiku, to evaluate completion status, which is often insufficient for long-running, complex tasks. Without a persistent local file system or knowledge base to track progress, agents suffer from context limits and performance degradation following chat compaction. Furthermore, the absence of task decomposition leads to unclear definitions of what 'done' looks like for the agent.

Goal Buddy Architecture and Agents

Goal Buddy coordinates three specialized roles: a judge, a scout, and a worker.
The system persists work state locally using YAML files instead of relying on chat history.
An oracle signal provides a clear, verifiable definition of task success before execution begins.

Goal Buddy replaces unstructured workflows with a three-agent team. The judge agent maintains read-only access to critically review decisions, the scout maps active tasks into evidence receipts, and the worker executes single tasks at a time. The system's PM role coordinates these agents and ensures the workflow adheres to the objective defined by the oracle, which could be a test suite or specific code output.

Operational Workflow and Implementation

The 'goal prep' command initializes the workflow and removes ambiguity by querying the user for specifics.
Workflow progress is managed through 'slices,' which are small, easily verifiable tasks defined in a state.yaml file.
Goal Buddy performs significantly better than standard goal commands on vague, non-quantifiable tasks like UI design by forcing precise requirement gathering.

To start, Goal Buddy creates a goal.md file and a state.yaml file to track project IDs, active tasks, and completion rules. It runs tasks sequentially, updating a dashboard in real-time so users can monitor progress. Testing shows that while Goal Buddy takes longer to execute than native commands, it provides higher-quality, more reliable outcomes by forcing the agent to align with specific, quantifiable success criteria.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video