Why AI Agent Design Is Still Hard: Key Challenges & Solutions

高效码农

2 months ago

Agent Design Is Still Hard

Have you ever wondered why building AI agents feels like navigating a maze? Even with all the tools and models available today, putting together an effective agent system involves a lot of trial and error. In this post, I’ll share some practical insights from my recent experiences working on agents, focusing on the challenges and lessons learned. We’ll cover everything from choosing the right SDK to handling caching, reinforcement, and more. If you’re a developer or someone with a technical background looking to build or improve agents, this should give you a solid starting point.

Let’s start with the basics: what makes agent design tricky? At its core, an agent is a loop that processes inputs, calls tools, and generates outputs. But as you dive deeper, differences in models, tools, and workflows start to complicate things. I’ll break it down section by section, answering common questions along the way.

Choosing the Right SDK for Your Agent

When you’re starting to build an agent, one of the first decisions is which software development kit (SDK) to use. You might be asking, “Should I go with a low-level SDK like OpenAI’s or Anthropic’s, or something higher-level like Vercel AI SDK or Pydantic?” Based on what I’ve seen, it’s not always straightforward.

We initially chose the Vercel AI SDK, but only for its provider abstractions, and handled the agent loop manually. Looking back, that might not have been the best call. Why? Higher-level SDKs try to unify things across models, but when you get into real agent work—especially with tools—the differences between models become too big to ignore. For example, abstractions can break when dealing with provider-specific tools, like Anthropic’s web search tool, which can mess up message history.

Here’s a quick comparison of approaches in a table to make it clearer:

Approach	Pros	Cons
Low-level SDK (e.g., OpenAI or Anthropic)	Full control over agent loop; clearer error messages; easier cache management	More manual work to set up
Higher-level SDK (e.g., Vercel AI SDK)	Simplifies provider switching; good for basic setups	Abstractions don’t fit complex agents; issues with tool unification

If you’re wondering how to decide, think about your tools. If they’re provider-specific, stick to the native SDK. We found that building our own abstractions on top of native SDKs gave us better flexibility. And if you’ve had success with a different setup, I’d love to hear about it—maybe there’s a better way out there.

Lessons on Caching in Agents

Caching is one of those things that sounds simple but can make or break your agent’s efficiency. You might ask, “How do different platforms handle caching, and which is better for agents?” Platforms vary a lot. Some, like Anthropic, require you to manage cache points explicitly and even charge for it. At first, that seemed cumbersome—why not let the platform handle it automatically?

But after working with it, explicit caching turned out to be a game-changer. It gives you predictability in costs and utilization. For instance, you can branch conversations in different directions or edit context without guesswork. In contrast, automatic caching on other platforms can be inconsistent.

Here’s how we implement caching with Anthropic:

Set a cache point right after the system prompt.
Add two more at the conversation’s start, with one advancing as the conversation grows.
Optimize by feeding dynamic info (like current time) in later messages to avoid invalidating the cache.

This approach keeps things static where possible, reducing cache thrash. If you’re dealing with high costs, explicit management lets you assume better cache hits. Have you tried splitting conversations? It’s a powerful technique enabled by this control.

Using Reinforcement in the Agent Loop

Reinforcement might sound advanced, but it’s essentially feeding extra guidance back into the loop after tool calls. A common question is, “How can I keep my agent on track during long tasks?” That’s where reinforcement shines.

Every tool call is a chance to remind the agent of the overall goal, update task status, or provide hints on failures. For example, if a tool fails, inject a message suggesting alternatives. Or, if background state changes (like in parallel processing), notify the agent.

Sometimes, the agent can reinforce itself. Take a “todo write” tool: it just echoes back a list of tasks the agent thinks it needs. It doesn’t do much, but it helps maintain focus as the context grows.

We also use it for recovery: if a retry starts from bad data, reinforce with a message to backtrack. Here’s a step-by-step way to add reinforcement:

After a tool call, check for relevant updates (e.g., state changes or failures).
Craft a concise message summarizing them.
Inject it into the next loop iteration.

This keeps the agent adaptive without overwhelming the context.

Isolating Failures to Keep Agents Running Smoothly

Failures are inevitable in agents, especially with code execution. You might wonder, “What if a failure derails the whole process?” The key is isolation.

One way is subagents: run error-prone tasks in isolation until they succeed, then report back only the success and a summary of what didn’t work. This prevents cluttering the main context while still sharing lessons.

Another option, if your model supports it (like Anthropic), is context editing. Remove failed attempts that don’t contribute to progress, saving tokens for later steps. But be cautious—it invalidates caches, so weigh the cost.

Pros and cons of isolation methods:

Subagents: Keeps main loop clean; allows learning from summaries.
Context Editing: Preserves tokens; but trashes caches and requires careful selection of what to remove.

In practice, knowing what didn’t work helps avoid repeats, but you don’t always need the full details.

Working with Subagents and Shared State

As agents get more complex, you often need subagents or sub-inference loops. A frequent question: “How do I handle data sharing between tools?” Our solution is a shared virtual file system.

This acts like a common storage where tools can read and write. For code-based agents, it’s crucial. Imagine generating an image in one tool, then zipping it in another—without shared state, you’re stuck.

Tools need to support file paths:

An ExecuteCode tool accesses the file system.
A RunInference tool does the same, taking paths as inputs.

Avoid dead ends: ensure no tool locks data away. If you unpack a zip with code, inference should access those files next.

The Role of Output Tools in Agents

How does an agent communicate final results? In our setup, it’s not a chat—it’s task-oriented. We use an “output tool” that the agent calls explicitly, like sending an email.

But steering its tone is harder than expected. Why? Models might not handle it as naturally as direct text output. We tried a sub-LLM (like Gemini 2.5 Flash) for tone adjustment, but it added latency and sometimes leaked unwanted info.

If the tool isn’t called, we reinforce with a message to encourage it. Steps for implementing an output tool:

Define the tool in your prompt, specifying when to use it.
Monitor if it’s called at loop end.
If not, inject reinforcement and retry.

This ensures completion, but tone remains a challenge.

Selecting Models for Agents

Model choice depends on the task. You might ask, “Which models are best for tool calling in agents?” Haiku and Sonnet stand out—they’re reliable callers and transparent in reinforcement.

For sub-tools like summarizing docs or image extraction, Gemini 2.5 handles large inputs well and avoids safety filters that trip others.

Remember, cheaper tokens don’t always mean cheaper agents—a better caller uses fewer overall. Our picks haven’t shifted much recently.

Challenges in Testing and Evaluations

Testing agents is tough because of their dynamic nature. “How do I eval an agent effectively?” You can’t just plug into external systems; you need to instrument real runs or use observability data.

We’ve tried various solutions, but none fully satisfy. It’s frustrating, as evals are key to improvement. If you have tips, share them.

Updates on Coding Agents

For coding tools, I’m exploring Amp more. Why? Its subagent interactions, like Oracle with the main loop, are well-designed. It feels built by users, unlike some others.

No major changes otherwise—still focused on execution and generation.

Terminal screenshot showing Tmux in use with Claude Code

In the image above, you can see an example of using Tmux to run Python interactively in a coding agent setup. It’s a practical way to give agents interactive skills.

Other Insights and Resources

Here are some additional points worth considering:

Simplifying tools for browser agents: Use minimal CLI commands like start, navigate, evaluate JS, and screenshot via Bash. This keeps context small. I built a skill from it for web browsing.
The shift in open source: Small libraries might fade as AI generates utilities on demand.
Tmux for interactive systems: Great for agents—give them Tmux skills for better control.
LLM APIs as synchronization: A separate topic, but relevant to agent comms.

FAQ

What is an AI agent in simple terms?

An AI agent is a system that loops through processing inputs, using tools, and producing outputs to complete tasks autonomously.

Why is building agents still hard?

Differences in models, tool integrations, caching, and failure handling create complexities that abstractions don’t fully solve.

How do I choose an SDK for my agent?

If tools are provider-specific, use native SDKs for control. Higher-level ones work for basics but falter in complex loops.

What’s the best way to handle caching?

Prefer explicit management for predictability. Set points after prompts and optimize with dynamic messages.

How can reinforcement improve my agent?

Use it after tool calls to update goals, hint on failures, or inject state changes, keeping the agent focused.

What if my agent keeps failing?

Isolate with subagents or context editing to hide details, reporting summaries to learn without clutter.

Why use a shared file system?

It allows tools to share data, avoiding dead ends and enabling workflows like code to inference and back.

How do output tools work?

They’re explicit calls for final communication, but tone steering is tricky—use reinforcement if not called.

Which models should I use?

Haiku/Sonnet for main loops; Gemini 2.5 for sub-tasks like summaries.

How do I test agents?

Instrument real runs with observability; external evals don’t capture the dynamics well.

How to Build a Basic Agent Loop

If you’re ready to start, here’s a step-by-step guide based on these lessons:

Select SDK: Choose native like Anthropic for control.
Set Up Prompt: Make system prompt static; add dynamic info later.
Implement Loop: Process input, call tools, reinforce as needed.
Add Caching: Explicit points after prompt and conversation start.
Handle Failures: Use subagents for isolation.
Share State: Integrate a virtual file system.
Define Output: Create a tool for final results, with reinforcement fallback.
Test: Run evals on instrumented sessions.

This should get you a functional agent. Experiment and adjust.

In wrapping up, agent design demands patience and iteration. These insights come from hands-on work, and while challenges persist, understanding them helps build better systems. If something resonates or you have questions, think about how it applies to your projects. Agents are evolving, but for now, they’re still hard— and that’s part of what makes the work interesting.