The Three Paradigm Shifts in AI Engineering: From Prompts to Context to Harness

AI Engineering Paradigm Evolution

The core question this article answers: What fundamental changes have occurred in AI engineering practices over the past three years, and why are the methods that used to work no longer sufficient?

AI engineering practice has undergone a clear, three-generation evolution from 2023 to the present. Each generation solves a fundamentally different core problem, yet each successive generation encompasses the capabilities of the previous one. Understanding the distinctions between these three paradigms is the prerequisite for grasping the current frontier of Agent engineering practices.

What Problem Does Each Generation of Engineering Solve?

The core question this section answers: What specific problems do Prompt Engineering, Context Engineering, and Harness Engineering solve, and what is the exact relationship between them?

Three generations of paradigms correspond to three distinct core problems. Figuring out exactly “what is being optimized” in each generation makes it clear why engineering methods had to upgrade.

Generation 1: Prompt Engineering (2023–2024)

The core problem of this generation was: How do you talk to the model to get a better answer?

All techniques revolved around a single conversational turn. The fine-tuning of wording, the constraining of output formats, the selection of few-shot examples, and the guiding of Chain of Thought—these methods shared a common trait. The engineer’s entire body of work was confined to the “input box.” You wrote a prompt, sent it, received the result, and the interaction ended.

Application scenario: Suppose you need a model to write a Python script for data cleaning. Under the Prompt Engineering paradigm, your job is to describe the requirements clearly enough—what format the input is in, what fields the output needs, how to handle outliers, and whether logging is required. You write this prompt, send it, get the script, manually inspect it, and use it. Throughout this process, the engineer’s scope of control is exactly that prompt itself.

This generation remains effective today, but it has a hard ceiling: it can only optimize the quality of a single interaction.

Generation 2: Context Engineering (2025)

The core problem shifted: A prompt alone is no longer enough; the entire context window must be treated as an engineering object.

Shopify CEO Tobi’s widely circulated quote, “Context engineering is the new skill,” precisely captures the signal of this paradigm shift. Engineers no longer just write a single prompt; they design all the information the model “sees” during a specific invocation.

A RAG retrieval system pulls relevant document fragments from a knowledge base. Long context management determines which pieces of information stay in the window and which get truncated. Tool use orchestration lets the model know which tools it can call and in what order. Memory systems let the model remember past interaction history—all of these fall under the umbrella of Context Engineering.

Application scenario: Returning to the data cleaning scenario, but this time it is not a single script; it is a continuously running pipeline. Under the Context Engineering paradigm, you need to design: Which field definitions should be retrieved from the database schema documentation each time the model is called? Which past precedents should be extracted from historical cleaning logs? Should the tool list include the calling interface for a data profiling tool? Should the feedback from the previous round of cleaning be injected into the context? You are optimizing all the information the model sees, not just a single prompt.

Generation 3: Harness Engineering (Early 2026)

The core problem escalated again: Agents can now run autonomously for hours or even days, and optimizing a single context window is nowhere near enough.

What you must design is no longer “the input for a single call,” but the Agent’s entire runtime environment. Multi-Agent collaboration architecture determines who is responsible for what and how information flows. Evaluation feedback loops determine how an Agent’s output is verified and corrected. The mechanized enforcement of architectural constraints ensures Agents do not lose control as scale increases. The governance and verification mechanisms of memory ensure the reliability of knowledge during long-term operation.

Application scenario: Suppose you want an Agent to autonomously build a complete data platform, from schema design to ETL pipelines to monitoring and alerting, a process that might take days. Under the Harness Engineering paradigm, you need to design: How does the planner Agent break requirements into subtasks? What architectural constraints bind the generator Agent when writing code? How does the evaluator Agent verify the correctness of each subsystem? How is shared memory among multiple Agents prevented from being polluted? None of these can be solved by “writing a good prompt”; they are the engineering design of a runtime environment.

The Relationship Between the Three Generations

Each generation encompasses the previous one. Harness contains Context, and Context contains Prompt. But the core problem each generation solves is completely different.

Author’s Reflection: A common misconception is to view these three generations as a “replacement relationship,” assuming that once Context Engineering arrives, you no longer need to learn Prompt techniques. But looking at the containment relationship, it is more akin to an “operating system kernel” upgrade. You still need to write application-layer code (prompts), but what determines the system’s ceiling is the underlying architecture (the harness system). Ignoring this hierarchical difference is the root cause of many teams stumbling in their Agent projects.

Comparison of the Three Paradigms

Anthropic’s Approach: Making Agents Evaluate Each Other

The core question this section answers: When an Agent needs to complete complex tasks autonomously, how do you guarantee the quality of its output, and why does self-evaluation fail?

An experiment by Anthropic engineer Prithvi Rajasekaran revealed a counter-intuitive fact: having an Agent evaluate its own work is essentially useless.

The Failure of Self-Evaluation

Regardless of the output quality, an Agent’s evaluation of its own work is invariably positive. This is not hard to understand—the model’s generative tendency is to “continue along the lines of what it just said.” Asking it to negate its own output is equivalent to asking it to fight its own generative logic.

This finding directly invalidates the design philosophy of many early Agent frameworks: generate code → self-review → revise → self-review → … This loop seems logical in theory, but in practice, the review step is nothing more than a facade.

Separating the Generator and the Evaluator

Splitting generation and evaluation into two independent Agents produces completely different results.

The evaluator does not read code and assign a score. Instead, it uses Playwright to actually operate the page—clicking buttons, filling out forms, and verifying whether features work correctly. It then scores the output across four dimensions: design quality, originality, craftsmanship, and functional completeness.

Application scenario: Suppose a generator Agent creates a login page. The evaluator does not look at whether the HTML is beautifully written; it directly launches a browser, opens the page, attempts to enter a username and password, clicks the login button, checks if the redirect is correct, and verifies if error messages display properly. This is the difference between “actual operation” and “code review.”

The Frontend Design Experiment

In the frontend design experiment, the generator went through 5 to 15 rounds of back-and-forth iteration with the evaluator. By the tenth round, it had produced a 3D spatial navigation solution. This result demonstrates that the evaluator’s feedback is not a simple binary “good/bad” judgment, but rather something capable of pushing the generator to explore higher-quality solutions.

The Full-Stack Development Experiment

The architecture for the full-stack development experiment was more complex, with three Agents dividing the labor:

Agent Role Responsibility Input Output
Planner Expands a one-sentence requirement into a complete product spec A one-sentence user description A complete product specification document
Generator Incrementally implements the product Product specification document React + FastAPI + PostgreSQL code
Evaluator Performs QA testing A runnable product Four-dimension scores + specific issue list

The Comparative Data

The contrast between the two datasets is highly illustrative:

  • Single Agent approach: 20 minutes, costing $9, producing an unusable output
  • Full Harness approach: 6 hours, costing $200, delivering a complete game with sprite animations, AI integration, and export functionality

The cost increased by 22 times, and the time increased by 18 times, but the output went from “unusable” to “a complete, deliverable product.” This level of cost-performance ratio is almost impossible to achieve in traditional software engineering.

The Most Valuable Discovery

As the capabilities of Opus 4.6 improved, sprint decomposition could be removed, but the evaluator could not be removed.

The implications of this finding are profound. Every component in a Harness encodes assumptions about the model’s limitations. As the model gets stronger, some assumptions no longer hold true—such as “the model is not capable enough, so tasks must be broken down very finely.” But some assumptions will always hold true—such as “the model cannot objectively evaluate its own output.”

Identifying which components to keep and which to remove is the core skill of Harness Engineering.

Author’s Reflection: This finding reminds me of an analogy: when a car engine upgrades from a carburetor to electronic fuel injection, many mechanical components are eliminated, but the cooling system and lubrication system are never eliminated—because the laws of thermodynamics do not change with technological progress. Similarly, model capabilities will improve, but the characteristic that “self-evaluation is unreliable” may approach a structural limitation that will not disappear with an increase in parameter count. When designing a Harness, distinguishing between “assumptions that will become obsolete” and “assumptions that will never become obsolete” is the most critical form of judgment.

Anthropic Experiment Architecture

OpenAI’s Approach: One Million Lines of Code with Zero Hand-Writing

The core question this section answers: When Agent capabilities are strong enough, what exactly is the human engineer’s role, and how do you maintain architectural consistency at the scale of a million lines of code?

OpenAI’s experiment was more radical. In five months, a small team used the Codex Agent to build a production system comprising approximately one million lines of code. Zero hand-writing. Application logic, documentation, CI configuration, observability infrastructure, and toolchains were all generated by the Agent.

The Radical Transformation of the Engineer’s Role

Engineers no longer write code. They do three things:

  1. Design the development environment—deciding the conditions under which the Agent works
  2. Express intent using structured prompts—telling the Agent what to do
  3. Provide the Agent with feedback loops—letting the Agent know whether it did things correctly

None of these three tasks involve “writing code,” yet each one requires deep engineering experience. Designing a development environment requires an understanding of dependency management and toolchains. Expressing intent requires structuring vague requirements. Providing feedback loops requires knowing what constitutes a “good output.”

Depth-First Working

OpenAI calls this methodology depth-first working: break a large goal into small components, have the Agent build each component, and then use these components to unlock more complex tasks.

Application scenario: Suppose you are building a data analytics platform. Instead of having the Agent “start writing from the homepage,” you first have it build the lowest-level type definitions and configuration systems. Then, you use these type definitions to build the data model layer, use the data model layer to build the API layer, and finally use the API layer to build the UI. The output of each layer serves as the “building material” for the next. This sequence is not arbitrary; it is dictated by architectural constraints.

Six-Layer Architectural Governance

Architectural governance is the key that makes this system function. The dependency hierarchy is strictly divided into six layers:

Layer Name Responsibility Constraint Rule
Layer 1 Types Type definitions Cannot depend on any other layer
Layer 2 Config Configuration management Can only depend on Types
Layer 3 Repo Data access layer Can only depend on Types and Config
Layer 4 Service Business logic layer Can only depend on lower layers
Layer 5 Runtime Runtime environment Can only depend on lower layers
Layer 6 UI User interface Can only depend on lower layers

The boundaries of each layer are not enforced by documentation conventions, but are mechanistically executed by linters and CI. Pull Requests from Agents that violate architectural constraints are automatically rejected. This means the Agent cannot “bypass” the rules—even if it is capable of writing code that calls across layers, the CI will block it.

Application scenario: Suppose the generator Agent, while writing a UI component, directly calls a Service layer function without going through the Runtime layer encapsulation. This code is syntactically perfectly correct and will not throw errors at runtime, but the linter in the CI will detect the cross-layer dependency and automatically reject the PR. The Agent must modify the code to call through the correct layer path. This is what “mechanistic enforcement” means.

Martin Fowler’s Assessment

Martin Fowler’s assessment was spot on: Harness Engineering encodes context engineering, architectural constraints, and garbage collection into machine-readable artifacts that Agents can systematically execute.

“Machine-readable artifacts” is the key phrase. Architectural constraints are not documents sitting on a Wiki; they are code living in linter rules. The Agent does not need to “understand” why cross-layer calls are prohibited; it only needs to know that “cross-layer calls will be rejected by CI.” This follows the exact same logic as a compiler preventing a programmer from writing type-incorrect code.

Author’s Reflection: The most striking aspect of OpenAI’s experiment is the “zero hand-writing.” But this does not mean “engineers are no longer needed”—quite the opposite. It means the value of engineers has shifted from “writing code” to “designing constraints.” This mirrors a pattern that has repeatedly appeared throughout software engineering history: in the assembly language era, programmers directly manipulated registers; in the high-level language era, programmers described logic while compilers manipulated registers; in the Agent era, engineers describe constraints while Agents write code. Each layer of abstraction has not eliminated the need for engineering capability; it has transferred that need to a higher level of abstraction.

OpenAI Architectural Governance

The Memory System: The Most Easily Overlooked Layer in a Harness

The core question this section answers: When multiple Agents collaborate over long periods, how do you ensure the reliability of shared memory, and do Agent systems with memory actually get better over time?

Anthropic discussed evaluation loops, and OpenAI discussed architectural constraints, but neither went deep into memory. This happens to be the gap filled by two academic papers.

The (S)AGE Paper: Byzantine Fault-Tolerant Multi-Agent Memory

The core problem: When multiple Agents share a knowledge base, how do you guarantee that the written knowledge is trustworthy?

An Agent might write incorrect information due to hallucination, or it might be injected with false memories through adversarial attacks. In a single-Agent scenario, a hallucination only affects a single output; but in a shared memory scenario, a single erroneous memory will be repeatedly read by all Agents, making the impact systemic.

Application scenario: Suppose three Agents collaboratively maintain an API documentation knowledge base. Agent A, due to a hallucination, records the return format of a certain interface incorrectly and writes it into shared memory. Agents B and C, when generating code subsequently, will both refer to this erroneous memory, resulting in the interface calls of the entire system being completely wrong. This is the cascading effect of “memory pollution.”

The Proof of Experience Consensus Mechanism

The solution proposed by the (S)AGE paper is the Proof of Experience consensus mechanism. Each Agent has a reputation weight, and this weight is determined by four factors:

Factor Description Function
Historical Accuracy The proportion of the Agent’s past memory writes verified as correct Filters Agents that consistently produce low-quality information
Domain Relevance The Agent’s expertise weight regarding the topic to be written Prevents cross-domain writing into unfamiliar areas
Activity Level The Agent’s participation frequency Prevents zombie accounts from manipulating votes
Independent Verification Count The number of times other Agents have independently verified the memory entry Raises the barrier to entry for writing

A memory submitted by an Agent must go through a weighted voting verification process before it can be written to the knowledge base.

Performance Data

This system was deployed on a 4-node BFT (Byzantine Fault-Tolerant) network, achieving a write throughput of 956 req/s and a P95 query latency of 21.6ms. Agents with this memory system demonstrated a calibration accuracy twice that of the memory-less baseline.

The Longitudinal Learning Paper: Does Memory Actually Make Systems Better?

The second paper answered an even more fundamental question: Do Agent systems with memory actually get better over time?

Experimental Design

The experimental design was exceptionally clever, directly comparing two strategies:

  • Treatment Group: 3 lines of prompt + the (S)AGE memory system, able to query all knowledge accumulated in previous rounds each turn
  • Control Group: 50 to 200 lines of expertly crafted prompts, but no memory, starting from scratch every round

After running for 10 rounds, the results were as follows:

  • The treatment group’s red-team evaluation difficulty grew from 0.8 to 3.0 (Spearman rho=0.716, p=0.020), indicating the system was genuinely getting stronger
  • The control group showed absolutely no growth trend (rho=0.040, p=0.901), demonstrating that even the most carefully crafted prompts cannot bring about longitudinal improvement

The Most Critical Finding

There was no statistical difference in the absolute performance levels between the two groups (Cohen’s d = -0.07). A 3-line prompt plus memory tied with a 200-line expert prompt.

The difference lay in the learning trajectory: the system with memory kept getting better the longer it ran, while the system without memory always stayed on the same level.

Application scenario: Imagine a security auditing Agent team. In the first week, they discover 10 vulnerability patterns and write them into shared memory. In the second week, when new tasks arrive, they can directly reference last week’s experience to discover more complex variants. By the tenth week, their auditing capability has far surpassed that of the first week. Another team, taking the same expert prompts and starting from scratch every week, is at the exact same level in week ten as in week one. This is what “longitudinal learning capacity” means.

Author’s Reflection: This experimental result made me rethink the value ceiling of “prompt engineering.” Two hundred lines of expert prompts tying with 3 lines of prompts plus memory—this means the entire effort of prompt engineering is easily leveled on the longitudinal dimension by a memory system. But this does not mean prompt engineering is useless; it means prompt engineering solves the problem of “initial performance,” while memory solves the problem of “growth potential.” An organization that only invests in prompt optimization without building memory infrastructure is like a company that only does training but does not build a knowledge base—everyone’s starting point might be a bit higher, but it will never become an “organization that gets better with every project.”

The memory layer does not bring higher initial performance to an Agent system; it brings organizational-level longitudinal learning capacity. A human organization’s 100th project is usually better than its 1st, because of process documentation, post-mortem reviews, and the accumulation of a knowledge base. Now, Agent systems are beginning to exhibit the exact same characteristics.

Memory System Architecture

The Essential Differences Between the Three Paradigms

The core question this section answers: If you could only use one sentence to summarize the differences between the three paradigms, what should it be?

The differences between the three generations of engineering paradigms can be summarized in three sentences:

  • Prompt Engineering optimizes the interface between humans and the model
  • Context Engineering optimizes the model’s input space
  • Harness Engineering optimizes the Agent’s entire runtime environment

Anthropic’s experiment proved that an evaluation loop is orders of magnitude more effective than self-evaluation. OpenAI’s experiment proved that architectural constraints can allow Agents to maintain consistency at the million-line code level. The two academic papers proved that a consensus-verified memory system can endow Agent organizations with longitudinal learning capacity.

These three layers added together form the complete Harness:

Layer Source Problem Solved Consequence if Missing
Evaluation Mechanism Anthropic Output quality cannot be guaranteed Agent output is unusable
Architectural Constraints OpenAI Consistency collapses as scale increases Million lines of code become spaghetti
Memory Governance Academic Papers Unable to accumulate experience over long-term runs System remains permanently at initial performance

Without any single layer, the Agent system will lose control in one dimension or another.

Author’s Reflection: Looking back at these three layers, I notice an interesting pattern: the evaluation mechanism solves the problem of “is it good right now,” architectural constraints solve the problem of “will it fall apart when it gets big,” and memory governance solves the problem of “will it be better tomorrow.” These correspond precisely to the three mountains of traditional software engineering: quality control, scalability, and sustainability. Harness Engineering did not invent new engineering problems; it simply re-answered those old problems in the Agent era, and the form of the answers is completely different.


Engineering System Illustration
Image Source: Unsplash

Practical Summary and Action Checklist

Below is a checklist you can immediately reference if you are building an Agent system.

Evaluation Mechanism Checklist

  • [ ] In your Agent system, are generation and evaluation performed by different Agents?
  • [ ] Does the evaluator verify through “actual operation” (such as browser automation, API calls) rather than “reading code”?
  • [ ] Are the evaluation dimensions explicit and quantifiable (such as design quality, functional completeness, etc.)?
  • [ ] Is the evaluation feedback passed back to the generator in a structured format, rather than a simple “pass/fail”?

Architectural Constraints Checklist

  • [ ] Are your dependency layers explicitly divided (e.g., Types → Config → Repo → Service → Runtime → UI)?
  • [ ] Are layer boundaries mechanistically enforced through linters and CI, rather than relying solely on documentation conventions?
  • [ ] Are submissions from Agents that violate architectural constraints automatically rejected?
  • [ ] Are constraint rules stored in a machine-readable format (such as config files, rule scripts), rather than in natural language documents?

Memory Governance Checklist

  • [ ] Is there a write-verification mechanism for memories shared among multiple Agents?
  • [ ] Is there a reputation weight or similar trust model to distinguish the writing credibility of different Agents?
  • [ ] Are there mechanisms to prevent hallucinatory writes and adversarial injections?
  • [ ] Does the system possess longitudinal learning capacity—meaning is performance in round N better than in round 1?

Paradigm Positioning Checklist

  • [ ] Which paradigm is your team primarily working in right now? (Prompt / Context / Harness)
  • [ ] If an Agent runs for more than 1 hour, is your current Context Engineering approach sufficient?
  • [ ] If the codebase exceeds 10,000 lines, are your current architectural constraints sufficient?
  • [ ] If the project needs to run for more than 10 iterations, is there memory infrastructure to support longitudinal learning?

One-Page Summary

Dimension Prompt Engineering Context Engineering Harness Engineering
Era 2023–2024 2025 Early 2026
Optimization Target Interface between human and model Model’s input space Agent’s entire runtime environment
Core Techniques Wording, few-shot, CoT RAG, long context, tool use, memory Multi-Agent collaboration, evaluation loops, architectural constraints, memory governance
Ideal Scenario Single-turn conversation Single complex invocation Long-duration autonomous operation
Limitation Cannot exceed the quality ceiling of a single interaction Cannot handle accumulation issues during long-duration runs Highest design complexity
Key Assumption “Better phrasing → better output” “Better context → better output” “Better runtime environment → more reliable autonomous behavior”
Containment Relationship Base layer Contains Prompt Contains Context

Frequently Asked Questions

Q1: Is Prompt Engineering obsolete now?

No. Harness Engineering contains Prompt Engineering. You still need to write good prompts, but a prompt is now just one component within the entire Harness, no longer the sole object of engineering.

Q2: Why can’t an Agent evaluate its own output?

Because the model’s generative tendency is to “continue along the lines of what it just said.” Asking it to negate its own output is equivalent to forcing it to fight its own generative logic. Experimental data shows that no matter the output quality, an Agent’s self-evaluation is invariably positive.

Q3: What is the fundamental difference between an evaluator using Playwright to operate a page and reading code to assign a score?

Reading code to assign a score evaluates “whether the code looks correct,” while operating a page with Playwright evaluates “whether the product actually works.” The former can be deceived by superficial code quality, while the latter only cares about actual behavioral outcomes.

Q4: Why must the six-layer architecture in the OpenAI experiment be mechanistically enforced by CI?

Because the Agent is capable of writing syntactically correct code that calls across layers. If enforcement relies only on documentation conventions, the Agent can choose to comply or choose not to comply. Mechanistic enforcement via CI ensures that code violating architectural constraints cannot be merged at all, removing any dependence on the Agent’s “self-discipline.”

Q5: Why can a 3-line prompt plus memory tie with a 200-line expert prompt?

Because the memory system provides “longitudinal learning” capacity, allowing it to accumulate experience over multiple iterations to make up for the shortcomings of the initial prompt. A 200-line expert prompt, while having higher initial performance, lacks a memory system and cannot accumulate across rounds, remaining stuck at the same level permanently.

Q6: Which of the four factors in the Proof of Experience reputation weight is the most important?

The paper does not provide a singular weight ranking for the factors, but each has a distinct function: historical accuracy filters low-quality Agents, domain relevance prevents cross-domain writing, activity level prevents zombie manipulation, and independent verification counts raise the barrier to entry. Together, they form a complete trust model.

Q7: What exactly is the core skill of Harness Engineering?

It is not writing prompts, and it is not building RAG. It is “identifying which components to keep and which to remove.” Every component in a Harness encodes an assumption about the model’s limitations. As the model gets stronger, some assumptions no longer hold, but some will always hold. Distinguishing between these two types of assumptions is the ultimate judgment call.

Q8: Our team is just starting with Agent projects; which paradigm should we begin with?

Starting with Context Engineering is the most pragmatic choice. It has a higher ceiling than Prompt Engineering but a lower implementation complexity than a full Harness Engineering setup. However, when designing your Context solution, you should预留 interfaces for evaluation mechanisms and architectural constraints to prepare for the eventual upgrade to a Harness.