What Basic Toolset Does an Agent Need


Seeing everyone discussing the question of Agent toolsets—does providing a shell solve everything? After working on holon, I realize it's not that simple.
Read: Why abandon Read/Glob, switch entirely to shell
holon’s toolset has gone through several versions, ultimately discarding specialized tools like Read (file reading) and Glob (pattern search) provided by Claude Code, relying entirely on shell for reading and searching. This aligns with Codex’s approach—Codex’s ExecCommand is straightforward: reading files with cat, searching code with rg, without defining separate tools for each "read" operation.
The reason for this is simple: shell is the "programming language" most familiar to LLMs. Instead of making the model learn the semantics of your custom Read tool’s parameters, it’s better to have it generate shell commands it has been trained on billions of times. Every additional dedicated tool increases the model’s cognitive load; but the shell interface is already well-mastered by the model.
But relying solely on shell has a cost: output truncation. To prevent shell return values from being too long and overwhelming the context, the framework sets output limits for each command. For example, if Agent uses cat to read a large file, it might only get the first half, with the rest in an artifact file, requiring multiple cat commands to read everything. Claude Code’s Read tool has a much higher compression threshold than general shell, allowing reading large files in one step, reducing back-and-forth. Essentially, it’s a trade-off: fewer tools lower cognitive burden, but dedicated tools are more efficient in boundary scenarios.
Writing: From sed to ApplyPatch, and the challenges of free grammar tools
But write operations can’t be fully handled with shell alone.
If you make Agent only use sed for editing, you’ll find it difficult to handle complex multi-line matches—line breaks, escapes, indentation—any mistake can cause editing failure. So many systems provide tools like Replace String, allowing Agent to pass a large old_string to precisely match and replace with new_string. It’s clumsy but more stable than sed.
Codex went further, inventing its own ApplyPatch tool, enabling the Agent to generate patches directly for batch editing. holon borrowed this idea.
However, implementing this in practice hit a snag: Codex uses a simplified patch format defined by OpenAI itself, combined with a special mechanism called free grammar tool to handle format passing.
Why create a new mechanism? Because standard LLM tool definitions are in the form tool(args) with JSON parameters. Passing patches as JSON strings involves a lot of escaping—newlines become \n, quotes need backslashes, indentation must be carefully handled. Agent writing patches is already error-prone; adding another layer of JSON escaping doubles the chance of errors. The idea behind free grammar tool is to pass the raw patch text directly as the tool input, without JSON encoding—what the model writes is what it gets. This greatly reduces the error rate when generating patches.
Currently, only OpenAI’s Codex interface supports this mechanism. holon aims to be compatible with multiple model providers, so it can’t rely solely on this.
Therefore, holon’s approach is: inject different ApplyPatch definitions based on the model. For models supporting free grammar, use the raw patch format; for others, accept the standard git diff format. I believe that after training on billions of diffs from GitHub, LLMs are quite proficient with git diff format. In practice, it works reasonably well—though errors still happen, most of the time it’s correct, and with more training data, this ability will only improve. Still, I recommend that all model providers support free grammar tools, as they are a real necessity for code-writing scenarios in Agents.
Scheduling: Long-running commands and task abstraction
The third issue is that shell commands executed by the Agent may not finish quickly—starting a dev server, running tests, building projects—all can take a long time or never exit. Early Agent frameworks handled this crudely: either blocking synchronously, freezing the agent, or sending all commands to the background, causing the agent to repeatedly execute the same command multiple times.
Now, the industry is gradually converging on a basic consensus: don’t expose "foreground/background" choices to the Agent—because the Agent can’t judge that accurately. A better approach is to set a timeout threshold; if a command exceeds it, it automatically moves to the background, completely transparent to the Agent. The Agent doesn’t need to pre-judge whether to background a command; the runtime handles it.
But automatically backgrounding is just the first step. Once in the background, real engineering problems surface—and currently, there’s no standard solution.
First, how to read output. Background tasks may still be running or already finished, and their output can be large. But API semantics differ—some use polling, others push events.
Second, how to stop tasks. Different systems have cancel mechanisms, but should cancellation be immediate kill or graceful exit? Should partial output already generated be preserved?
Third, who wakes up the Agent. After sending a task to the background and putting it to sleep, who calls it back when the task ends? This requires deep integration between runtime and Agent scheduling, not something that can be solved purely at the tool level.
These three issues—reading output, stopping tasks, waking the Agent—together form the complete lifecycle management of background tasks. While many implementations support "background execution," there’s no standardized management scheme yet. This could be a key point for the next stage of Agent toolchain evolution.
It’s not time for mindless adoption of a ready-made pattern yet
So returning to the initial question: shell can handle 80%, but the remaining 20%—precise editing, patch format compatibility with model capabilities, scheduling long tasks—are precisely what determine whether an Agent can move from a demo to a truly usable system.
The choice of toolset is far more than just "wrap a shell," and it’s far from the point where everyone can blindly apply a ready-made pattern. That’s why Codex and Claude Code have different answers to these fundamental issues, and holon makes different trade-offs based on its own scenarios. There’s still much room for exploration and improvement.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments