Why AI Coding Agents Fail Without a Harness: Claude Code vs. Codex Engineering Philosophy

Anyone who has used AI to write code knows the typical cycle: initial awe at its capabilities, followed by a cold sweat when it secretly deletes files, alters configurations, or wrecks the Git history.

At this point, you might think the solution is simply to use a smarter model.

Two detailed engineering analyses on AI coding systems—specifically looking at Claude Code and Codex—suggest the exact opposite. The core problem is not model intelligence. The real issue is whether you have strapped a strong enough restraint system onto it. In engineering, this restraint system is called a Harness. Just like the harness on a draft horse, it is a set of continuous control structures used to dictate boundaries. For AI coding systems, without a harness, a more capable model simply means a larger blast radius when things go wrong.

Both Claude Code and Codex take harness design seriously, but their approaches to taming the model are fundamentally different. By comparing them, we uncover a fascinating dynamic: both “distrust” the model, but one chooses to supervise it on-site, while the other chooses to write the rules into a formal contract.

## Control Plane Design: Live Director vs. Strict Bureaucrat

A common misconception is that controlling an AI is just about writing a good system prompt that says, “You are a rigorous engineer.” Treating the control plane as a writing style issue is like mistaking a police system for a tone of voice.

Claude Code and Codex both treat instructions as behavioral control, but their implementations are worlds apart.

Claude Code uses a dynamic assembly line. Its system prompt is not a static block of text; it is layered. At the bottom is the default identity; above that are hard system rules (e.g., “do not mechanically retry after user refusal,” “context will be auto-compacted”); at the top are engineering directives (e.g., “do not over-optimize,” “do not hide verification failures to look good”).

Crucially, it enforces a strict priority chain. User overrides trump everything, followed by coordinator prompts, then task-specific agent prompts, with the default prompt at the very bottom. Think of it like corporate policy: you can add specific requirements for a role, but you cannot revoke the baseline safety regulations.

Codex operates like a numbered filing system. It refuses to treat instructions as casually strung-together natural language. Instead, an instruction is a “structured fragment” with a clear start, end, and source type. A local rule isn’t just read; it is categorized and archived within the system.

This difference shines in how they handle local rule files. Claude Code uses CLAUDE.md, acting like a bulletin board at a job site telling the system about local taboos. Codex uses AGENTS.md and goes further by defining the hierarchy and inheritance of these rules, ensuring the system explicitly understands the scope of a rule.

How to choose practically: If your biggest fear is rapid context drift in long sessions, Claude Code’s dynamic assembly is more agile. If your biggest fear is scattered rule origins where no one knows the exact jurisdiction of a constraint, Codex’s structured filing system is the safer bet.

## System Heartbeat: Session Engine vs. Dispatch Archive

Treating an agent system as a “multi-turn chat” severely underestimates it. The true challenge of an agent is continuity: how to pick up from the last turn, feed tool results back in, and clean up after an interruption.

Claude Code pins continuity entirely to its Query Loop. Inside this core loop, a bundle of states is tied together: current message sequences, tool context, compact tracking, output recovery counts, pending summaries, and turn counts. As long as the loop spins, these states update.

It operates like a session engine. Crucially, before it even lets the model think, it does a long list of chores: prefetching relevant memory, discovering available skills, slicing valid messages after the compact boundary, applying tool result budgets, and snipping history. It cleans the job site before allowing the model to start working.

Codex breaks continuity into explicit infrastructure: Threads, Rollouts, and State Bridges. A Thread is a first-class concept you can interact with directly. It holds an ID and binds execution conditions—like approval policies, working directories, and sandbox modes—explicitly to its lifecycle.

This creates a massive difference in recovery and auditing. Claude Code acts like an on-site fire brigade, fixing problems inside the loop right where they happen. Codex acts like a dispatch center with a filing cabinet; because threads have IDs and states are recorded, it is much easier to answer “what exactly happened in the last turn” rather than relying on runtime guesswork.

When building a system, you must ask: Who owns continuity? If it is the loop, you must focus on runtime grit. If it is thread state, you must build the filing cabinet first.

## Tool Orchestration: The Strict Foreman vs. The Legal Department

A model generating wrong text wastes time. A model executing the wrong command can destroy directories, processes, and repositories simultaneously. What separates good systems is who has the final say before a tool is executed.

Claude Code’s tool system acts like a strict, nagging foreman. It never assumes a tool call is a natural extension of model intelligence. Before execution, it checks tool schemas and runs isConcurrencySafe(). Safe tools are batched; unsafe tools are forced into serial execution. Even when concurrent execution is allowed, it caches the results and replays them in the original order so the fastest tool doesn’t hijack the context.

Its treatment of the Bash tool borders on obsessive. It writes massive rule sets: don’t mess with git config, don’t skip hooks, never run a blind git add ., and never default to push. High-risk interfaces demand high-density constraints. Anyone who acts casual around a terminal simply hasn’t seen enough production accidents.

For permissions, it enforces three states: Allow, Deny, and Ask. The system refuses to make decisions for the user. “Knowing how to do it” does not equal “having the authorization to do it.”

Codex operates like a construction firm with a legal department. It turns tools into typed interfaces first. A command execution tool doesn’t take a raw string; it demands structured fields: the command, working directory, shell type, timeout limits, max output tokens, and approval parameters. Correct usage is baked into the tool’s definition.

Furthermore, Codex extracts approvals into an independent Policy Engine featuring Policies, Rules, Evaluations, and Decisions. Execution boundaries here become a micro-policy language, not just a few if/else statements.

In practice: If you need to make snap judgments based on live context, Claude Code’s runtime approval is more responsive. If you need team-level rules that are readable and portable across projects, Codex’s policy language wins.

## Context Management: A Budget, Not a Storage Unit

There is a persistent illusion that stuffing more context into an AI makes it smarter. In reality, context is not a warehouse; it is an expensive, easily bloated, self-polluting budget. Claude Code’s approach to context management reads like a textbook on budget control.

First, it strictly layers long-term instructions (managed, user, project, local memory). Rules closer to the current working directory get higher priority. This prevents stable collaboration rules from mixing with temporary chat messages.

Second, memory entry points must be short. MEMORY.md is defined as an index, not a diary. Saving memory requires two steps: write the content to a separate file, then add a one-line pointer in the index. It hard-codes a limit (e.g., 200 lines or 25,000 bytes). If an index file acts as both a directory and the main text, it inevitably becomes a bloated, unreadable mess.

Third, short-term continuity uses fixed templates. Session memory isn’t a chat log; it is an operational manual with sections for Current State, Task Specs, Errors & Corrections, etc. It has strict token budgets. If you exceed the budget, the system demands aggressive cuts, prioritizing “Current State” and “Errors” because those are what keep the work moving.

Finally, auto-compacting operates like risk management. It doesn’t wait until the context window is gasping for air. It deducts the summary output budget and various buffer zones upfront. Compacting is a tracked state; if it fails three times consecutively, it triggers a circuit breaker and stops trying, acknowledging the harsh reality that retrying a doomed compact just wastes API calls.

When a compact succeeds, it doesn’t just leave a summary. It wipes old file-read states, regenerates file attachments, restores plan attachments, and fires post-compact hooks. The goal of compacting is to rebuild a functional runtime environment, not to write a beautiful summary.

## Error Recovery: How to Clean Up the Mess

The most worthless phrase in software engineering is “under normal circumstances.” Once an AI agent runs in production, errors are a stable constant: models hit token limits, prompts get too long, and hooks create infinite loops.

Claude Code treats errors with cold pragmatism: errors belong to the main path, and recovery must be pre-designed.

When it hits a “prompt too long” error, it doesn’t just throw the error at the user. It executes layered recovery. It first tries the cheapest route: committing any staged context collapses. If that fails, it triggers a heavier reactive full-text compact. It never swings the heaviest hammer first.

When it hits “max output tokens” truncation, it absolutely refuses to let the model apologize and summarize. It first tries bumping the token cap and rerunning. If that fails, it appends a hardcore directive: “Continue directly, do not apologize, do not recap, if interrupted mid-sentence, finish the sentence.” In long tasks, every post-truncation recap consumes budget and causes semantic drift, until the system is no longer doing the task, but just reviewing itself doing the task.

Even the compaction tool itself can trigger a “prompt too long” error. To handle this deadlock, Claude Code slices off the earliest API rounds from the head of the history and retries. This loses data, but it prioritizes unblocking the user over preserving a perfect record.

Interrupts are also treated as error recovery. If the user hits stop during a stream, the system consumes the remaining results and generates “synthetic results” for tools that were dispatched but never finished. It refuses to pretend an action never happened just because the user pressed Escape.

Rules to live by: Recovery paths must be layered by cost; recovery logic must prevent infinite loops; automatic recovery must have a circuit breaker; and post-truncation, always resume, never summarize.

## Multi-Agent Orchestration: Partitioning Uncertainty

When a single agent hits its limits, and research, implementation, and verification all fight for the same context window, the obvious answer seems to be multi-agent. But multiple agents don’t naturally bring order; they often just parallelize the chaos.

Claude Code’s first principle for sub-agents is cache safety. When forking, it must share cache-critical parameters to ensure prompt cache hits. If a sub-agent re-burns the parent context every time, it looks like parallel efficiency but is just parallelized waste.

The second principle is state isolation. By default, all mutable states are isolated. File-read states are cloned, abort controllers generate child controllers rather than sharing the parent’s. Sharing must be explicitly opted into; isolation is the default ethic. The true value of a sub-agent is that it prevents its local mess from polluting the main thread.

In coordinator mode, the rules are ironclad: the coordinator must read the research results itself and write prompts containing specific files and line numbers. It is forbidden to say “based on your findings” and outsource comprehension. Research can be distributed, but understanding must be re-centralized.

Verification is treated as a lifeline. Tasks are split into Research, Synthesis, Implementation, and Verification. Verification exists to prove the code works, not to confirm the code exists. The implementer self-reports, and the verifier acts as a second layer of QA. In agent systems, there is a wide river between “I changed the code” and “the code is correct.” Models are experts at building paper bridges over that river.

## Team Deployment: From Personal Tricks to Institutional Rules

An individual might use an AI flawlessly because they carry countless invisible assumptions in their head: knowing which commands are dangerous, when to intervene, and what one-liner will keep the model in line. This cannot be copied to a team.

The first step in team deployment is writing stable, indisputable rules into CLAUDE.md (or AGENTS.md). Put codebase hard constraints, unified verification standards, and output discipline here. Do not put frequently changing process details here, or it will become an unmaintainable encyclopedia that teaches the system to follow expired laws.

The second step is treating Skills as “institutional slices,” not long prompts. Claude Code often runs skills in forked sub-agents with their own budgets and tool boundaries. Teams must define a skill’s boundaries, allowed tools, outputs, and verification methods. Otherwise, skills degrade into well-named but unpredictable slogans.

The third step is using approvals to draw liability boundaries. Who approved an action, why was it auto-approved, and what requires a manual ask? Approvals shouldn’t be split by tool type, but by the irreversibility of the consequences. Reading files can be relaxed; pushing to Git or touching production must be strict.

The final step is using hooks to pin rules to the lifecycle. Fire logs when instructions load, add organizational notes when compacting, and archive when sessions end. Making rules trigger at the right moment is far more effective than stuffing them all into a static file.

## The Final Checklist: Which Philosophy Fits Your System?

If you are building an AI coding system or deploying one across a team, do not rush to pick a side. Ask yourself these six questions first:

  1. Who holds the final control: the model, or your harness structure?
  2. Does your continuity live in the main loop, or in thread/state infrastructure?
  3. Who throws the last blockade before a tool executes?
  4. How do local rules enter the system, and how are they layered?
  5. Who handles verification, and how is it kept independent?
  6. When things break, what does the team use to trace the root cause?

If you already have a prototype, but long sessions constantly derail, context gets messy, and sub-agents are left orphaned, study Claude Code’s runtime heartbeat first. Learn how to survive the job site.

If you already have rules, but their origins are scattered, permission boundaries are blurry, and adding new tools creates chaos, study Codex’s explicit control layer first. Build the filing cabinet.

If you are starting from zero, resist the urge to borrow from both equally. Ask yourself what your primary uncertainty is today: the model acting erratically, or the team losing institutional control. Build the bone structure for that primary problem first. Engineering systems are not built by buying the best tools; they are built by erecting breakwaters in the exact order the floods tend to arrive.