Agent Harness: The Core Infrastructure for Building Production-Grade AI Agents
Have you ever faced this frustrating situation? You spend a lot of time building an AI chatbot that responds accurately and uses tools to complete simple tasks in demos. But when you try to turn it into a production-grade application, everything falls apart—the model forgets what it did a few steps ago, tool calls fail silently, and the context window gets filled with useless information, causing the entire agent’s performance to plummet.
If you blame the large language model (LLM) itself, you’re probably looking in the wrong direction. The real core issue lies in the complete software infrastructure wrapping the LLM—this is the Agent Harness, an increasingly critical component in the AI field today. From Anthropic and OpenAI to LangChain, leading companies and frameworks are deeply investing in this area, because it is the key to turning a “stateless LLM” into a capable agent.
Based on industry best practices, this article breaks down the core logic, components, operating mechanisms, and key design decisions of the Agent Harness, helping you understand why the harness is the soul of a production-grade agent.
From Demo to Production: The Gap in AI Agents Lies in the “Harness”
Many people mistakenly believe that the performance ceiling of an AI agent is only determined by the LLM’s parameters and training data. But real-world cases completely overturn this misconception:
LangChain once only adjusted the infrastructure wrapping its LLM (with the exact same model and weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. Another research project let an LLM optimize this infrastructure autonomously, achieving a 76.4% pass rate—surpassing manually designed systems.
This clearly proves that what determines the production-grade performance of an AI agent is never the model itself, but the entire “support system” around it: the Agent Harness.
What Is an Agent Harness?
Although the term Agent Harness was formally defined in early 2026, its core concept has existed for a long time. Simply put, the Agent Harness is the complete software infrastructure that wraps an LLM, covering the orchestration loop, tooling system, memory system, context management, state persistence, error handling, guardrails, and all other modules.
Anthropic directly states in its Claude Code documentation: its SDK is “the agent harness that powers Claude Code.” OpenAI’s Codex team also equates “Agent” and “Harness” to refer to the non-model infrastructure that makes LLMs useful.
A critical distinction often confuses people:
- ❀
Agent: The emergent behavior—the goal-directed, tool-using, self-correcting entity that users interact with. - ❀
Harness: The underlying machinery that produces this behavior.
When someone says “I built an agent,” they essentially built a harness and connected it to an LLM.
To make this concept easier to understand, Beren Millidge drew a precise analogy in his 2023 essay:
- ❀
A raw LLM is like a CPU with no RAM, no disk storage, and no I/O. - ❀
The context window acts as RAM—fast but limited in capacity. - ❀
External databases function as disk storage—large in capacity but slow. - ❀
Tool integrations act as device drivers. - ❀
The Agent Harness is the entire operating system.
This analogy reflects a simple logic: we are essentially reinventing the Von Neumann architecture, as it is a natural abstraction for any computing system—including AI agents.
Three Levels of Engineering Surrounding the LLM
To understand the role of the Agent Harness, you first need to grasp the three concentric engineering levels around the LLM (from the inside out):
-
Prompt Engineering: Refines the instructions received by the model, focusing on making the model “understand what to do.” -
Context Engineering: Manages what the model sees and when it sees it, focusing on controlling information supply. -
Harness Engineering: Encompasses the first two levels plus the entire application infrastructure—tool orchestration, state persistence, error recovery, verification loops, safety enforcement, lifecycle management, etc. Its core is to enable autonomous agent behavior.
A key note: The Agent Harness is not a wrapper around prompts. It is a complete system that makes autonomous agent behavior possible.
11 Core Components of a Production-Grade Agent Harness
Synthesizing practices from Anthropic, OpenAI, LangChain, and the broader practitioner community, a deployable production-grade Agent Harness consists of 11 core components. Each component performs its own duties to support the stable operation of the agent.
1. The Orchestration Loop: The “Heartbeat” of the Agent
This is the core of the Agent Harness, implementing the Thought-Action-Observation (TAO) cycle (also known as the ReAct loop). The workflow is straightforward:
Assemble prompt → Call LLM → Parse output → Execute tool calls (if any) → Feed results back to context → Repeat until task completion.
Technically, it is essentially a while loop. The complexity lies not in the loop itself, but in everything the loop manages. Anthropic describes its runtime as a “dumb loop”—all intelligent decisions are made by the model, and the harness only manages the “turns” to ensure the loop proceeds by the rules.
2. Tools: The “Hands and Feet” of the Agent
Tools are the carrier for the agent to interact with the external world. The core is to define the tool’s schema (name, description, parameter types) and inject it into the LLM’s context so the model knows “what tools it can use.”
The core responsibilities of the tool layer include:
- ❀
Tool registration: Adding available tools to harness management - ❀
Schema validation: Ensuring tool parameters and formats comply with specifications - ❀
Argument extraction: Parsing parameters required for tool calls from model output - ❀
Sandboxed execution: Running tools in an isolated environment to avoid risks - ❀
Result capture and formatting: Converting tool execution results into model-readable “observations”
For example, Claude Code provides tools across six categories: file operations, search, code execution, web access, code intelligence, and subagent spawning. OpenAI’s Agents SDK supports function tools, hosted tools (WebSearch, CodeInterpreter, FileSearch), and MCP server tools.
3. Memory: The “Brain Storage” of the Agent
The memory system supports continuous agent interaction through two timescales:
- ❀
Short-term memory: Conversation history within a single session, invalid when the session ends. - ❀
Long-term memory: Cross-session persistent information. For example, Anthropic uses project files and auto-generated log files; LangGraph uses namespace-organized JSON stores; OpenAI supports Sessions backed by SQLite or Redis.
Claude Code also features a three-tier memory system that balances efficiency and completeness:
-
Lightweight index (~150 characters per entry): Always loaded for fast retrieval. -
Detailed topic files: Loaded on demand to supplement key information. -
Raw transcripts: Accessed only via search to avoid occupying context.
A critical design principle: The agent treats its own memory as a “hint” and verifies consistency between memory and actual state before acting to avoid decisions based on incorrect memory.
4. Context Management: The “Information Filter” of the Agent
This is the biggest pain point for many agents in production—known as context rot: Model performance drops by over 30% when key content falls in the middle of the context window (confirmed by Chroma research and Stanford’s “Lost in the Middle” study). Even million-token windows suffer from degraded instruction-following ability as context grows.
Four common context management strategies in production:
- ❀
Compaction: Summarize conversation history when approaching the window limit. Claude Code retains architectural decisions and unresolved bugs while discarding redundant tool outputs. - ❀
Observation masking: Hide old tool outputs while keeping tool calls visible (e.g., JetBrains’ Junie). - ❀
Just-in-time retrieval: Maintain lightweight identifiers and load data dynamically. Claude Code uses commands like grep,glob,head,tailto read file fragments instead of full files. - ❀
Sub-agent delegation: Let subagents explore tasks in depth but return only 1,000–2,000 token condensed summaries.
Anthropic’s context engineering guide defines the core goal: Find the smallest set of high-signal tokens that maximize the likelihood of the desired outcome.
5. Prompt Construction: The “Instruction Set Assembly” of the Agent
This step assembles the actual input the model receives in each round, following a hierarchical structure:
System prompt → Tool definitions → Memory files → Conversation history → Current user message.
Different vendors adopt different priority rules. For example, OpenAI’s Codex uses a strict priority stack:
-
Server-controlled system message (highest priority) -
Tool definitions -
Developer instructions -
User instructions (cascading config files, 32 KiB limit) -
Conversation history
6. Output Parsing: The “Result Translator” of the Agent
Modern harnesses rely on native tool calling—the model returns a structured tool_calls object instead of free text that requires parsing. The harness’s core tasks:
- ❀
Check if the output contains tool call requests. - ❀
Execute tools and enter the next loop if requests exist. - ❀
Judge as the final answer and end the loop if no requests exist.
For structured outputs, both OpenAI and LangChain support schema-constrained responses via Pydantic models. Legacy solutions (e.g., RetryWithErrorOutputParser) are retained for edge cases—feeding the original prompt, failed completion, and parsing error back to the model for self-correction.
7. State Management: The “Progress Archive” of the Agent
State management ensures the agent’s running status is trackable and resumable. Different frameworks implement it differently:
- ❀
LangGraph: Models state as typed dictionaries, merges updates via reducers, and creates checkpoints at super-step boundaries to support post-interruption resumption and time-travel debugging. - ❀
OpenAI: Offers four mutually exclusive strategies—application memory, SDK sessions, server-side Conversations API, lightweight previous_response_idchaining. - ❀
Claude Code: Uses git commits as checkpoints and progress files as structured scratchpads.
8. Error Handling: The “Fault-Tolerance Mechanism” of the Agent
Why is error handling critical? A 10-step process with 99% per-step success only has an ~90.4% end-to-end success rate—errors compound rapidly.
LangGraph classifies errors into four types for targeted handling:
- ❀
Transient errors: Retry with backoff. - ❀
LLM-recoverable errors: Return errors as ToolMessagefor the model to adjust strategies. - ❀
User-fixable errors: Interrupt the process and wait for human input. - ❀
Unexpected errors: Bubble up for debugging.
Anthropic catches failures within tool handlers and returns them as error results to keep the loop running. Stripe’s production harness caps retry attempts at 2 to avoid infinite retries.
9. Guardrails and Safety: The “Behavior Boundary” of the Agent
OpenAI’s SDK implements three levels of safety controls:
- ❀
Input guardrails: Validate input before the first agent runs. - ❀
Output guardrails: Validate the final output before return. - ❀
Tool guardrails: Validate every tool invocation.
It also features a “tripwire” mechanism that halts the agent immediately when triggered.
Anthropic architecturally separates permission enforcement from model reasoning: The model decides “what to attempt,” and the tool system decides “what is allowed.” Claude Code gates ~40 discrete tool capabilities independently with three stages: trust establishment at project load, permission checks before each tool call, and explicit user confirmation for high-risk operations.
10. Verification Loops: The “Self-Check Mechanism” of the Agent
This is the key difference between toy demos and production agents. Anthropic recommends three verification methods:
- ❀
Rules-based feedback: Validate results via tests, linters, and type checkers. - ❀
Visual feedback: Verify UI task results via screenshots with tools like Playwright. - ❀
LLM-as-judge: Use an independent subagent to evaluate output compliance.
Boris Cherny, creator of Claude Code, noted that adding self-verification capabilities improves output quality by 2–3x.
11. Subagent Orchestration: The “Team Collaboration” of the Agent
Complex tasks often require division of labor among multiple agents, with different frameworks implementing this in varied ways:
- ❀
Claude Code: Supports three execution modes—Fork (byte-identical copy of parent context), Teammate (separate terminal pane with file-based mailbox communication), Worktree (dedicated git worktree, isolated branch per agent). - ❀
OpenAI SDK: Supports “agents-as-tools” (specialists handle bounded subtasks) and “handoffs” (specialists take full control). - ❀
LangGraph: Models subagents as nested state graphs.
The Agent Harness Runtime Loop: Step-by-Step Workflow
After understanding the core components, we can trace how they work together in a complete cycle.
Step 1: Prompt Assembly
The harness splices the model’s full input hierarchically: System prompt + Tool schema + Memory files + Conversation history + Current user message. Critical content is placed at the beginning and end of the prompt to avoid the “Lost in the Middle” problem and maximize the model’s capture efficiency of key information.
Step 2: LLM Inference
The assembled prompt is sent to the model API, and the model generates output tokens—pure text, tool call requests, or both.
Step 3: Output Classification
The harness parses the model output and makes three judgments:
- ❀
Only text with no tool calls: End the loop and return text as the final result. - ❀
Tool call requests: Proceed to tool execution. - ❀
Agent handoff requests: Update the current running agent and restart the loop.
Step 4: Tool Execution
For each tool call request, the harness completes the sequence: Argument validation → Permission check → Sandboxed execution → Capture execution results. A key detail: Read-only operations run concurrently; mutating operations run serially to avoid data conflicts.
Step 5: Result Packaging
Tool execution results are formatted into model-readable messages. Errors are encapsulated as “error results” and returned to give the model a chance to self-correct.
Step 6: Context Update
Append tool results (or errors) to the conversation history. Trigger context compaction if the context window approaches its limit.
Step 7: Loop
Return to Step 1 and repeat the process until a termination condition is met.
Termination Conditions (Multi-Layer Design)
The loop does not run indefinitely; it terminates when any of these conditions are triggered:
- ❀
The model returns a text response with no tool calls. - ❀
The maximum turn limit is exceeded. - ❀
The token budget is exhausted. - ❀
The safety guardrail tripwire is triggered. - ❀
The user manually interrupts. - ❀
The model returns a safety refusal.
Turn counts vary widely by task: Simple questions take 1–2 turns; complex code refactoring may chain dozens of tool calls across many rounds.
A real-world example: Claude Code’s workflow follows an “Initializer Agent sets up the environment (init script, progress file, feature list, initial git commit) → Coding Agent reads git logs and progress files → Selects the highest-priority incomplete feature → Develops → Commits code → Writes summaries” pattern. The file system acts as a “continuity carrier” across context windows.
How Mainstream Frameworks Implement the Agent Harness Pattern
Different AI frameworks have unique implementations of the Agent Harness, but all follow the core components and loops described above.
1. Anthropic Claude Agent SDK
Anthropic exposes harness capabilities through a single query() function, which creates the agentic loop and returns an async iterator for streaming messages. Its runtime is the “dumb loop” mentioned earlier, with all intelligent decisions delegated to the model.
Claude Code adds a Gather-Act-Verify cycle:
- ❀
Gather: Search files, read code to obtain context. - ❀
Act: Edit files, run commands to advance the task. - ❀
Verify: Run tests, check output to confirm results. - ❀
Repeat until task completion.
2. OpenAI Agents SDK
OpenAI implements the harness via the Runner class, supporting three modes: async, sync, and streamed. Its core feature is “code-first”—workflow logic is written in native Python instead of graphical domain-specific languages (DSLs).
OpenAI’s Codex harness extends this into a three-layer architecture:
- ❀
Codex Core: Agent code + runtime. - ❀
App Server: Bidirectional JSON-RPC API. - ❀
Client surfaces: CLI, VS Code extension, web app.
All clients share the same harness, which is why “Codex models perform better on Codex surfaces than in generic chat windows.”
3. LangGraph
LangGraph models the harness as an explicit state graph, with two core nodes (llm_call, tool_node) and one conditional edge:
- ❀
Tool calls present → Route to tool_node. - ❀
No tool calls → Route to END.
LangGraph evolved from LangChain’s AgentExecutor, which was deprecated in v0.2 due to poor extensibility and lack of multi-agent support. LangChain’s Deep Agents explicitly use the term “agent harness,” with built-in tools, planning capabilities, file-system context management, subagent spawning, and persistent memory.
4. CrewAI
CrewAI focuses on a role-based multi-agent architecture with three core components:
- ❀
Agent: The harness around the LLM, defined by role, goal, backstory, and tools. - ❀
Task: The smallest unit of work. - ❀
Crew: A collection of agents.
Its Flows layer adds a “deterministic backbone with intelligence where it matters”—the harness manages routing and validation, while crews handle autonomous collaboration.
5. AutoGen (Microsoft Agent Framework)
AutoGen pioneered conversation-driven orchestration. Its three-layer architecture (Core, AgentChat, Extensions) supports five orchestration patterns:
- ❀
Sequential - ❀
Concurrent (fan-out/fan-in) - ❀
Group chat - ❀
Handoff - ❀
Magentic (a manager agent maintains a dynamic task ledger to coordinate specialist agents)
The Scaffolding Metaphor: The Present and Future of the Agent Harness
The “scaffolding” metaphor for the Agent Harness is precise, not decorative: Construction scaffolding is temporary infrastructure that lets workers reach heights they couldn’t otherwise. It doesn’t build the structure itself, but workers can’t complete high-rise work without it.
This metaphor also reveals a key trend: When the building (model) is complete, the scaffolding (harness) is removed. As LLM capabilities improve, the complexity of the Agent Harness should gradually decrease. For example, Anthropic’s Manus project was rebuilt five times in six months, with each rewrite reducing complexity—complex tool definitions simplified to general shell execution, “management agents” reduced to structured handoffs.
This reflects the co-evolution principle: Modern models are post-trained with specific harnesses in the loop. Claude Code’s model learned to use its dedicated harness; modifying tool implementations may degrade performance due to this tight coupling.
There is a “future-proofing test” for harness design: A sound design allows performance to scale with more powerful models without adding harness complexity.
7 Core Decisions That Define Every Harness
Every harness architect faces seven critical choices when designing an Agent Harness—these decisions directly determine the harness’s performance, complexity, and adaptability.
1. Single-Agent vs. Multi-Agent
Anthropic and OpenAI share the same recommendation: Maximize a single agent first. Multi-agent systems add overhead—extra LLM calls for routing, context loss during handoffs. Split into multi-agent only when tool overload exceeds ~10 overlapping tools or task domains are clearly separated.
2. ReAct vs. Plan-and-Execute
- ❀
ReAct: Interleaves reasoning and action at every step—flexible but high per-step cost. - ❀
Plan-and-Execute: Separates planning from execution. LLMCompiler shows this mode is 3.6x faster than sequential ReAct.
3. Context Window Management Strategy
Five mainstream production strategies: Time-based clearing, conversation summarization, observation masking, structured note-taking, sub-agent delegation. ACON research shows prioritizing reasoning traces over raw tool outputs reduces token usage by 26–54% while retaining 95%+ accuracy.
4. Verification Loop Design
Verification falls into two categories:
- ❀
Computational verification (tests, linters): Provides deterministic ground truth but only validates formal compliance. - ❀
Inferential verification (LLM-as-judge): Catches semantic issues but adds latency.
Martin Fowler’s Thoughtworks team frames this as guides (feedforward, steer before action) vs. sensors (feedback, observe after action).
5. Permission and Safety Architecture
Two extreme approaches:
- ❀
Permissive: Fast but risky; auto-approves most actions. - ❀
Restrictive: Safe but slow; requires approval for every action.
The choice depends on the deployment scenario (e.g., more permissive for internal use, stricter for public services).
6. Tool Scoping Strategy
More tools often lead to worse model performance. Vercel removed 80% of tools from v0 and achieved better results; Claude Code achieves 95% context reduction via lazy loading. The core principle: Expose only the minimum tool set needed for the current step.
7. Harness Thickness
This refers to how much logic lives in the harness vs. the model:
- ❀
Anthropic: Bets on “thin harnesses”—the harness only handles basic management, delegating logic to the model. It regularly deletes planning steps from Claude Code’s harness as new model versions internalize these capabilities. - ❀
Graph-based frameworks: Bet on “thick harnesses” for explicit control and precise management.
FAQ: Common Questions About the Agent Harness
Q1: Why does my AI agent keep failing in production?
The core cause is not the model, but flawed Agent Harness design: Poor context management causing “information loss,” missing error handling leading to compounding errors, absent verification loops resulting in unreliable outputs, oversized tool scopes confusing model decisions, etc. Check the completeness of your harness components first instead of replacing the model.
Q2: What’s the difference between an Agent Harness and a regular LLM wrapper?
A regular LLM wrapper only handles basic “call model + return result” functions, often just wrapping prompts. The Agent Harness is a complete system covering orchestration loops, memory, context management, error handling, safety controls, etc., supporting autonomous and stable agent operation.
Q3: Are multi-agent systems always better than single-agent systems?
No. Multi-agent systems add complexity like context loss, extra LLM call costs, and coordination overhead. Anthropic and OpenAI both advise: Maximize single-agent capabilities first, and only adopt multi-agent when tool overload (≥10 overlapping tools) or clearly separate task domains exist.
Q4: Does a larger context window reduce Agent Harness management pressure?
No. Even million-token windows suffer from degraded instruction-following ability as context grows (confirmed by Stanford’s “Lost in the Middle” study). The core is to use strategies like compaction and just-in-time retrieval to let the model only access “high-signal” information, not simply expand the window.
Q5: How do I know if my Agent Harness design is future-proof?
The core test: When LLM capabilities improve, your agent’s performance improves without adding harness complexity. If every model upgrade requires harness refactoring, your design is flawed.
Conclusion
The Agent Harness is the core of production-grade AI agents—it is not an accessory to the model, but a complete infrastructure that transforms a stateless LLM into a goal-directed, actionable, self-correcting agent.
From LangChain’s ranking surge to the deployment of Claude Code and OpenAI Codex, one conclusion is clear: Two products using identical models can have wildly different performance based solely on harness design.
The Agent Harness is evolving rapidly toward a trend of thinner harnesses + stronger models, but the harness itself will never disappear. Even the most capable LLM needs a harness to manage its context window, execute tool calls, persist state, and verify outputs.
Next time your AI agent underperforms, don’t blame the model—check if your Agent Harness is built well enough.

