Log in to leave a comment
No posts yet
Tools like Cursor or Devin are convenient. However, it's difficult to know exactly what's happening under the hood, and they sometimes butcher code in ways you didn't intend. For a backend developer, building your own optimized agent using the Python standard library and LLM APIs is far more economical and reliable.
To make an agent go beyond just writing code and actually execute terminal commands, you need to handle the subprocess module with precision. Using the shell=True option blindly exposes you to shell injection attacks or creates "zombie" processes that don't die during timeouts.
When implementing this yourself, set shell=False in your subprocess.run() calls and pass commands as a list. Keep timeouts short—around 30 seconds—and if a TimeoutExpired exception occurs, immediately call process.kill() to reclaim resources. You also don't need to send the entire execution result back to the model. If the text exceeds 1,000 characters, truncate it to return only the last 20 lines. This saves tokens while providing enough information for the model to identify the cause of an error.
As conversations grow longer, the data accumulating in the context window becomes a cost bomb. According to Anthropic, using cache_control markers in the Claude 3.5 model can reduce the cost of reading cached data by up to 90%, bringing it down to roughly $0.30 per million tokens.
To reduce costs, strictly separate system messages from user input. Fix static information, such as the project's entire file tree structure, at the top of the system prompt and set it as a cache point. When conversation history overflows, calculate the volume with tiktoken and use a hierarchical summarization approach where old messages are summarized via a separate LLM call. By maintaining only this summarized context at the top and applying a sliding window for recent messages, you can defend the model's reasoning accuracy while cutting costs by more than 40% in long development sessions.
Making an agent output an entire file again is crude and slow. The more output tokens there are, the higher the probability that the model will omit middle sections or hallucinate. The so-called "Edit Trick" guides the model to find and replace only the text (anchors) surrounding the necessary changes. This technique can reduce the volume of output tokens by up to 86% in real-world data.
Use Python's re.sub() to create a function that applies only the modifications passed via specific XML tags or regular expressions to local files. Furthermore, instead of stuffing every technical manual into the prompt, design the system to fetch only necessary document snippets by attaching a lightweight vector DB like LanceDB. This architecture makes file modification feel over 79% faster and solves the chronic issue of models getting confused when working with large files.
Stop the manual labor of copying and pasting debugging messages for the model. Make the agent write pytest-based test code before it ever writes the actual implementation.
Simply build a feedback loop where the full traceback generated by a failed test is fed back to the model as-is for self-correction. However, for dangerous commands like rm or deploy, you must insert guardrails that require user approval using Python's input() function. Once this cycle is complete, a developer only needs to check the git diff summary provided by the agent and hit the commit button.
Ultimately, the core of building an agent lies not in flashy frameworks, but in how sophisticatedly you refine and cache data between the terminal and the LLM. 600 lines of Python code you wrote yourself will reflect your intentions much better than a black-box tool with tens of thousands of lines.