The Tool Abstraction Problem: Why Your APIs Make Terrible LLM Tools
Writer
If you are building AI agents, you’ve likely run into the wall. You give your agent a list of neatly formatted tools, ask it to perform a moderately complex task, and watch as it hallucinates arguments, calls the wrong endpoints, or gets stuck in an infinite loop.
Having built north of 10,000 tools—dating back to the days before the Model Context Protocol (MCP) even existed, when we relied on custom (Open Execution Protocol) protocols to wrangle unpredictable GPT-3 JSON outputs—I can tell you that the root of the problem isn’t the model. It’s the abstraction.
We are treating large language models like traditional software, and it is breaking our agents. Here is how to architect tools that models can actually use, moving from User Experience to a true Machine Experience (MX).
1. The Fallacy of Auto-Generated Tools: APIs vs. Tools
There is a massive misconception in the agentic engineering space: “I have an API. I will just run a generic ‘API-to-Tool’ generator to turn all my endpoints into tools.” Do not do this. It will fail.
The abstraction of an API and the abstraction of an LLM tool serve two entirely different audiences.
- APIs are for code-to-code interfaces; you design them for another programmer or a strictly defined system interface.
- Tools are for LLM-to-code interfaces; you design them for a probabilistic engine.
Sometimes the necessary abstraction is higher, sometimes it’s lower, but it is always fundamentally different.
2. Composition Over Chaining: The Hardest Problem in AI
If you look at the Berkeley Function Calling Leaderboard, Nestful, or almost any recent paper on tool composition, they all point to the same glaring issue: The Chaining Problem.
Asking an LLM to sequence multiple granular tools to complete a standard task (e.g., get_user_id → find_calendar → get_event_id → submit_ticket) has a greater than 50% failure rate.
The Solution: Move the composition logic inside the tool.
Instead of forcing the LLM to orchestrate a fragile sequence, provide a single, composite, intent-based tool: find_calendar_and_submit_complaint that handles the logic internally. While this might sound overly specific, both practical application and academic theory (such as Apple’s Tool Sandbox paper) prove that the accuracy of tool selection and execution skyrockets when you reduce the sequence burden on the LLM by utilizing higher abstraction.
3. Task and Intent-Oriented Design
Agents operate best when they process tasks like a to-do list. Anthropic introduced a lot of the ecosystem to this Gherkin-style reasoning.
When you name and design your tools, model them around intents. A generic get_user tool is often too vague for an agent to confidently select. A tool named track_order_and_return_report gives the agent an explicit task to match against its current objective. Rely on context-based customer or entity identification within the tool’s backend logic, rather than forcing the agent to extract and provide unnecessary rigid fields just to satisfy an API schema.
4. The 10x Lever: Description Quality
If you want a 10x improvement in your agent’s reliability, stop tweaking the system prompt and start optimizing your tool descriptions.
Across thousands of evaluations, the quality of the tool description is the single heaviest influence on whether an LLM will select the right tool. Why? It comes down to context window positioning. Frameworks typically place schemas and tool definitions at the very bottom of the context window. According to the “needle in a haystack” principle, this makes the description the last thing the model “thinks” about before generating an output, so the LLM relies heavily on them.
Rules for Writing Tool Descriptions:
- Keep it brief: Do not exceed 600 words.
- Structure matters: Start with a strong action verb, followed by a short, task-enabled, intent-focused sentence.
- Iterate: Research (such as the Hesh et al. paper) shows that models can jump from failing at basic zero-shot tasks to successfully navigating complex activities solely through optimized descriptions.
5. Surviving the Tool Count Cliff
Agents suffer from cognitive overload. If you hand an agent more than ~20 tools, you hit the “Tool Count Cliff” (referenced in Apple’s Tool Sandbox paper), and its selection accuracy will plummet.
How do you build complex systems with this limitation?
- Progressive Discovery / Dynamic Selection: Introduce context and tools dynamically at runtime as the agent progresses through a task. This helps manage the token limit and keeps the context window focused (similar to the Redis paper on token reduction). Note that this helps manage limits but doesn’t fix poor abstractions!
- Handling High Tool Counts / Enumeration (Q&A): What if you need variations of tools? If you absolutely must, enumerate and name them differently. However, if an agent needs more than 40 tools, its scope is too broad. Break the architecture down into specialized sub-agents or sub-tasks (as seen in the Block/Square paper on self-discovering flows), each armed with a focused, task-specific toolkit.
6. Testing and Determinism
Finally, if you are introducing non-determinism (an LLM) into your system, the deterministic part (the tool execution) must be 100% reliable.
Far too many developers are skipping standard unit testing for their tools. You need PyTest, Jest, or your testing framework of choice to guarantee that when the LLM outputs the correct arguments, the tool executes perfectly 100% of the time.
Pair this with multi-model evaluations (using suites like Arcade) to ensure your abstractions hold up regardless of the underlying foundational model.
Conclusion
Building tools for LLMs isn’t about exposing your backend; it’s about translating human intent into machine-executable actions. Focus on the description, abstract the chains away, ensure testing determinism, and build for the machine experience.
Referenced Research & Resources
- Apple Tool Sandbox: Insights on the Tool Count Cliff and composite tools.
- Nestful / Berkeley Function Calling Leaderboard: Data on chaining failure rates.
- Block/Square paper: Research on self-discovering flows and handling sub-agents.
- Redis paper: Strategies for token reduction and dynamic context.
- Hesh et al.: Research on description optimization jumping models from zero-shot failure to success.
- Arcade tool patterns (by Guru): Practical repository for modeling domain-specific tasks and multi-model evaluations.
Read next