Self-Improving Skills for AI Agents: Building Systems That Learn from Failure

Core Question: How can AI agent skills automatically adapt and improve when environments change, instead of relying on manual maintenance of static prompts that quickly become outdated?

In AI agent development, we face a fundamental challenge: skills are typically static, but the environment around them constantly changes. A skill that worked perfectly a few weeks ago can silently begin failing when the codebase updates, when model behavior shifts, or when the types of tasks users request evolve over time. In most systems, these failures remain invisible until someone notices the output quality has degraded or tasks start failing completely.

This is why we need to fundamentally rethink what skills are—they shouldn’t be fixed prompt files, but living system components capable of self-improvement over time.

AI Agent Skills Evolution
Source: Unsplash

Why Traditional Skill Management Fails in Production

Core Question: Why does the traditional “write prompt → save to folder → call when needed” approach only work for demos and fail at scale?

Until now, most AI agent systems have managed skills through a simple cycle: write a prompt, save it in a folder, and call it whenever needed. This approach works surprisingly well for demonstrations, but once systems reach a certain scale, we inevitably hit the same wall.

The Four Fatal Flaws of Static Skills

1. Some Skills Get Selected Too Often

When multiple skills are available, routing mechanisms may favor certain skills, causing them to be called excessively while others are neglected. This imbalance worsens over time.

2. Skills That Look Good But Fail in Practice

Some skills appear perfect during design but consistently fail in actual operation. Issues may lie in trigger conditions, execution steps, or output formatting.

3. Individual Instructions That Continuously Fail

Specific instructions within a skill may repeatedly fail, but without systematic observation and recording, these problems are difficult to locate and fix.

4. Tool Calls Break Due to Environmental Changes

This is the most common and insidious problem. When underlying codebases, API interfaces, or dependent services change, tool calls within skills suddenly fail, yet the skill itself has no warning mechanism.

The worst part is that when problems occur, no one knows whether the issue lies with routing, instructions, or the tool call itself. This leads to endless manual maintenance and inspection work, trapping teams in constant firefighting mode.

Real-World Scenario: The Cascading Failure of Skills

Imagine you’ve built a skill for code review. Initially, it perfectly identifies potential issues in code. But three months later, your team introduces new coding standards, and the codebase structure has changed. The skill begins giving outdated advice or even false positives.

Without an observation mechanism, you don’t discover this problem until users complain. Even worse, you can’t quickly pinpoint where the issue lies: Was the skill incorrectly called? Do the review standards need updating? Or has the underlying code analysis tool changed?

This is the dilemma of static skill systems—they cannot perceive their own performance or adapt to change.

The Closed-Loop System for Self-Improving Skills

Core Question: How do we build a complete closed-loop system that enables skills to automatically learn and improve when they fail or underperform?

The key to solving the static skills dilemma lies in establishing a complete self-improvement cycle. This cycle isn’t simply “observe and modify,” but a rigorous five-step process: Observe → Inspect → Amend → Evaluate → Update.

Self-Improvement Cycle Diagram
Source: Unsplash

The Complete Improvement Cycle: More Than Just Modification

A true self-improving system should never be trusted simply because it can modify itself. Any amendment must be rigorously evaluated: Did the new version actually improve outcomes? Did it reduce failures? Did it introduce errors elsewhere?

Therefore, the cycle cannot simply be:

Observe → Inspect → Amend

Instead, it must follow a more disciplined cycle:

Observe → Inspect → Amend → Evaluate

If an amendment does not produce a measurable improvement, the system should be able to roll it back. Because every change is tracked with its rationale and results, the original instructions are never lost. This transforms self-improvement into a structured, auditable process rather than uncontrolled modification. When evaluation confirms improvement, the amendment becomes the next version of the skill.

Step 1: Skill Ingestion—Giving Skills Structured Life

Core Question: How do we transform traditional skill folders into structured data with semantic meaning that systems can intelligently understand and route?

Traditional skill folder structures might look like this:

my_skills/
  summarize/
  bug-triage/
  code-review/

This simple folder structure is intuitive but lacks the ability for systems to truly understand the essence of skills. We need to give skills clearer structure—not just to look neater, but to make searching and routing more effective.

From Folders to Knowledge Graphs

By introducing structured data points (Custom DataPoints), we can add the following dimensions of information to skills:

Semantic Meaning: What problem the skill truly solves
Task Patterns: What types of tasks are suitable for calling this skill
Summary Information: Quick overview of the skill
Relationship Network: How this skill relates to other skills

Imagine your skills are no longer just Markdown files in folders, but nodes in a knowledge graph with rich metadata and connections. Through this graph, the system can understand:

Which skills are suitable for handling code review tasks
What relationships exist between code review skills and bug classification skills
Under what circumstances which skill should be prioritized

Practical Case: Structuring a Code Review Skill

A structured code review skill might include:

Basic Information:

Skill Name: code-review
Applicable Scenarios: Pull Request review, code quality checks
Trigger Conditions: Code commit detected, user explicitly requests review

Semantic Tags:

Code quality, best practices, security vulnerabilities, performance optimization

Related Skills:

bug-triage (bug classification)
test-generation (test generation)
documentation-update (documentation updates)

Historical Performance:

Success rate: 92%
Average execution time: 3.5 seconds
Common failure reasons: Missing context, codebase structure changes

This structuring enables the system to make smarter decisions rather than simply relying on keyword matching to select skills.

Step 2: Observe—Making Failures Reasonable

Core Question: Why is the fundamental reason skills cannot improve the system’s lack of memory about each execution result, and how do we establish effective observation mechanisms?

A skill cannot improve if the system has no memory of what happened when it ran. This is why we need to systematically store key data after each skill execution.

Five Core Data Points That Must Be Recorded

1. What Task Was Attempted

Record the specific task content requested by the user or system. This includes task input parameters, expected output format, and task context information.

2. Which Skill Was Selected

Record why the system selected this particular skill instead of other possible options. This is crucial for subsequent analysis of routing decision accuracy.

3. Whether It Succeeded

Clearly record the skill’s execution result: complete success, partial success, or complete failure. This judgment should be based on predefined success criteria, not vague subjective assessment.

4. What Error Occurred

If execution failed, record error details in detail: error type, error location, error stack trace, and any relevant context information.

5. User Feedback (If Any)

Collect explicit user feedback on skill output, such as ratings, comments, or implicit feedback, such as whether the user modified the output or re-executed the task.

Observation Data Structure Design

Through structured data points (Custom DataPoints), we can create an observation node for each skill execution containing the following fields:

ExecutionObservation:
  - task_id: Unique task identifier
  - skill_id: Unique skill identifier
  - timestamp: Execution timestamp
  - input_parameters: Input parameters
  - output_result: Output result
  - success_status: Success/failure status
  - error_details: Error details (if failed)
  - execution_time: Execution duration
  - user_feedback: User feedback
  - environment_snapshot: Environment snapshot (codebase version, model version, etc.)

Scenario Example: Observation Records for Bug Classification Skills

Suppose you have a bug-triage skill for automatically classifying and prioritizing bug reports. An execution observation record might look like this:

Task: Classify newly submitted bug report
Skill: bug-triage
Input:

Title: Login page crashes on Safari browser
Description: Application crashes when user clicks login button on iOS Safari
Reproduction steps: 1. Open Safari 2. Visit login page 3. Click login

Output:

Severity: High
Category: Frontend crash
Priority: P1
Assigned to: Frontend team

Execution Result: Success
Execution Time: 2.3 seconds
User Feedback: User accepted the classification result and fixed the bug within 2 hours

Environment Snapshot:

Codebase version: v2.3.1
Model version: gpt-4-turbo-2024-04-09
Dependent service: bug-tracking-api v1.5.0

With such observation records, the system can analyze in the future: under what circumstances does this skill perform well, under what circumstances might it fail, and how to improve it.

Step 3: Inspect—Mining Improvement Clues from Failure History

Core Question: Once skills have accumulated enough failure records, how do we systematically inspect and analyze this historical data to identify repetitive factors causing poor outcomes?

Once enough failed runs accumulate (or even after a single important failure), one can inspect the connected history around that skill: past runs, feedback, tool failures, and related task patterns.

The Power of Graph Structures: Tracing Root Causes of Failure

Because all this data is stored as a graph, the system can trace recurring factors behind bad outcomes and use that evidence to propose a better version of the skill.

Inspection Process:

Run Record Analysis → Identify repeated weak outcome patterns
Failure Clustering → Group similar failures together
Factor Correlation → Find correlations between failures and specific conditions
Evidence Aggregation → Generate the basis for improvement suggestions

Practical Application of Inspection: Pattern Recognition

Suppose your summarize skill has the following observation records over the past month:

Failure Pattern 1: Long Document Summarization Fails

Number of failures: 15
Common characteristics: Document length exceeds 10,000 words
Error type: Context limit exceeded
Frequency: 3-4 times per week

Failure Pattern 2: Poor Quality Technical Document Summarization

Number of failures: 23
Common characteristics: Technical documents containing大量 professional terminology
Error type: User feedback “missing key information”
Frequency: 5-6 times per week

Failure Pattern 3: Multi-language Document Processing Fails

Number of failures: 8
Common characteristics: Documents mixing Chinese and English
Error type: Output混乱 or interrupted
Frequency: 1-2 times per week

Through this systematic inspection, you’re no longer blindly guessing where the problem lies, but understanding skill performance based on data-driven evidence.

From Inspection to Insight: A Real Case

Let’s look at a more specific example. Your code-review skill has experienced frequent failures when inspecting Python code. By inspecting historical records, you discover:

Observed Patterns:

Failures concentrated in the past two weeks
All failures occurred when reviewing code using new decorators
Error messages all say “cannot recognize decorator syntax”
Environment snapshots show: codebase introduced a new decorator library two weeks ago

Root Cause:
The skill’s prompt contains outdated Python syntax examples that don’t cover new decorator usage.

Improvement Direction:
Update the syntax examples in the skill, adding usage instructions for new decorators.

This evidence-based inspection transforms skill improvement from a guessing game into an optimization process with clear objectives.

Step 4: Amend Skills—Intelligent Patching Based on Evidence

Core Question: When the system has sufficient evidence that a skill is underperforming, how do we automatically generate targeted modification suggestions instead of having humans blindly search for and fix broken prompts?

When the system has accumulated enough evidence that a skill is underperforming, it can propose amendments to the instructions. That proposal can be reviewed by a human or applied automatically. The goal is simple: reduce the friction of maintaining skills as systems grow.

From Manual Maintenance to Intelligent Suggestions

Instead of manually searching through your codebase for broken prompts, the system can look at the execution history of a skill, including past runs, failures, feedback, and tool errors, and suggest a targeted change.

Amendment Suggestions May Include:

1. Tighten Trigger Conditions

If a skill is being incorrectly called too often, the system might suggest:

Adding more specific trigger keywords
Increasing precondition checks
Limiting applicable task types

2. Add Missing Conditions

If a skill fails under certain circumstances, the system might suggest:

Adding boundary case handling
Supplementing exception handling logic
Adding input validation steps

3. Reorder Execution Steps

If the skill’s execution order causes problems, the system might suggest:

Adjusting the sequence of steps
Merging or splitting certain steps
Adding intermediate validation points

4. Change Output Format

If the skill’s output doesn’t meet downstream needs, the system might suggest:

Adjusting the output data structure
Adding or removing output fields
Changing the level of output detail

Mechanism for Generating Amendment Suggestions

How does the system generate these amendment suggestions? The key lies in transforming execution history into actionable insights:

Step 1: Failure Clustering

Group similar failures together to identify common characteristics.

Step 2: Root Cause Analysis

For each type of failure, analyze possible root causes:

Are trigger conditions too broad?
Are instructions unclear?
Is necessary context missing?
Does the output format not meet expectations?

Step 3: Generate Modification Plans

Based on root cause analysis, generate specific modification suggestions:

If trigger conditions are too broad → tighten trigger conditions
If instructions are unclear → rewrite or supplement instructions
If context is missing → add context acquisition steps
If output format is incorrect → adjust output template

Step 4: Priority Ranking

Prioritize modification suggestions based on failure frequency and impact.

Practical Case: Intelligent Modification of Bug Classification Skills

Suppose your bug-triage skill has experienced the following problems:

Observed Problems:

Over the past two weeks, 30% of bugs have been incorrectly classified as “low priority”
All these bugs actually contain the keywords “crash” or “data loss”
User feedback shows these bugs should be fixed within 24 hours but were actually delayed

System-Generated Amendment Suggestion:

Original Instruction:

Determine priority based on bug description and reproduction difficulty:
- P1: Affects core functionality, requires immediate fix
- P2: Affects secondary functionality, requires fix within this week
- P3: Minor issue, can be fixed in next version

Suggested Amendment:

Determine priority based on bug description and reproduction difficulty:
- P1: Affects core functionality, or contains keywords "crash", "data loss", "security vulnerability", requires immediate fix
- P2: Affects secondary functionality, requires fix within this week
- P3: Minor issue, can be fixed in next version

Special note: Any bug mentioning "crash", "data loss", or "security" is automatically upgraded to P1

This amendment suggestion is based on actual failure data, specifically addressing the problem.

From Static Files to Evolving Components

This is the key moment when skills transform from static prompt files into evolving components. You no longer need to open a SKILL.md file and guess what to change. The system can propose well-founded patch suggestions based on evidence of how the skill actually performed.

Step 5: Evaluate & Update—Ensuring Improvements Are Real

Core Question: How do we verify whether skill amendments truly bring improvement rather than introducing new problems, and how do we establish safe rollback mechanisms?

This is the most critical yet most easily overlooked step in self-improving systems. Any amendment must undergo rigorous evaluation to ensure it truly brings improvement rather than making things worse.

Core Evaluation Metrics

1. Did Results Improve?

Does the modified skill version show improvement in key metrics?

Has the success rate increased?
Has output quality improved?
Has user satisfaction increased?

2. Did Failures Decrease?

Did the modification reduce specific types of failures?

Do previously frequent errors no longer occur?
Has failure frequency decreased?
Has error severity decreased?

3. Did It Introduce New Errors?

Did the modification produce side effects elsewhere?

Do previously normal functions still work normally?
Have new failure patterns emerged?
Has performance been affected?

A/B Testing: Scientific Evaluation Method

To accurately evaluate the effect of modifications, A/B testing methods should be employed:

Test Design:

Control Group: Use the original version of the skill
Experimental Group: Use the modified skill
Test Sample: Select representative task sets
Evaluation Period: Sufficient time to collect statistical data

Evaluation Metrics:

Success rate comparison
Average execution time comparison
User feedback score comparison
Frequency comparison of specific error types

Decision Criteria:

If experimental group significantly outperforms control group → accept modification
If no significant difference between experimental and control groups → continue observation or abandon modification
If experimental group underperforms control group → reject modification and analyze reasons

Rollback Mechanism: Safety Guarantee

If modifications don’t produce expected improvements or even make things worse, the system should be able to immediately roll back to the previous version.

Key Elements of Rollback:

1. Version Tracking

Every modification should have complete version records:

Version number
Modification time
Modification content
Modification rationale
Performance comparison before and after modification

2. Fast Rollback

When problems are discovered, you should be able to:

One-click rollback to any historical version
Automatic rollback (when serious problems are detected)
Gradual rollback (gradually reduce the proportion of new version usage)

3. Change Audit

All modifications should be traceable:

Who (or which system) proposed the modification
Based on what evidence
What were the evaluation results
What was the final decision

Practical Case: Lessons from Evaluation Failure

Suppose your summarize skill underwent the following modification:

Modification Content:
To improve long document processing capability, process documents in chunks and then merge results.

Evaluation Results:

Success rate: Increased from 85% to 92% ✓
Execution time: Increased from 3 seconds to 8 seconds ✗
Output coherence: User feedback “lack of logical connection between paragraphs” ✗

Decision:
Although the success rate improved, execution time and output quality both deteriorated. The system decides to roll back this modification and redesign the solution.

Lessons Learned:
You cannot look at just a single metric; you must comprehensively evaluate the impact across multiple dimensions.

Author Reflections: Insights from Practice

In researching and implementing this self-improving system, I’ve gained several key insights:

1. Failure Is Not the Enemy, But the Teacher

In traditional thinking, we fear failure and try to avoid it through stricter testing and review. But in this system, failure becomes fuel for improvement. Every failure is a learning opportunity, as long as we have mechanisms to observe, record, and analyze it.

2. Automation Doesn’t Mean Giving Up Control

Some may worry that letting the system automatically modify skills will lose control. But in reality, this system precisely enhances control. Through structured observation, inspection, and evaluation processes, we have more control over skill quality than with manual maintenance. Automation merely reduces repetitive labor, not supervision.

3. Data-Driven Decisions Are Superior to Intuition

Before this system, we often modified skills based on intuition: “I think this instruction isn’t clear enough,” “I guess the problem might be here.” Now, we can base decisions on data: “This error occurred 47 times, all under X circumstances, for Y reason.” This transformation turns skill maintenance from art into science.

4. Incremental Improvement Is Better Than One-Time Refactoring

Don’t try to solve all problems at once. Let the system take small, fast steps, with each modification undergoing rigorous evaluation. This way, even if a modification fails, the impact is limited and can be quickly rolled back.

Action Checklist: Implementing Self-Improving Skills

Phase 1: Establish Observation Mechanisms

[ ] Define core data fields that need to be recorded
[ ] Implement logging for skill execution
[ ] Establish user feedback collection channels
[ ] Design data storage structure (graph database recommended)

Phase 2: Build Inspection Capabilities

[ ] Implement clustering analysis for failure records
[ ] Develop root cause analysis tools
[ ] Establish failure pattern recognition algorithms
[ ] Create visual inspection interface

Phase 3: Implement Intelligent Modification

[ ] Design modification suggestion generation rules
[ ] Implement evidence-based modification algorithms
[ ] Establish priority ranking for modification suggestions
[ ] Develop human review interface (if needed)

Phase 4: Perfect Evaluation System

[ ] Define evaluation metric system
[ ] Implement A/B testing framework
[ ] Establish version management and rollback mechanisms
[ ] Develop change audit logs

Phase 5: Continuous Optimization

[ ] Monitor overall system performance
[ ] Collect user feedback
[ ] Regularly review and improve processes
[ ] Expand to other skill types

One-Page Summary

Core Problem: Static skills cannot adapt to changing environments; self-improvement mechanisms are needed.

Solution: Five-step closed-loop system

Ingestion: Structure skills with semantic meaning
Observe: Record detailed information for each execution
Inspect: Analyze failure patterns to find root causes
Amend: Generate targeted modification suggestions based on evidence
Evaluate: Verify modification effectiveness for safe updates

Key Value:

Reduce manual maintenance costs
Improve skill adaptability
Accelerate problem response speed
Ensure improvement traceability

Implementation Essentials:

Data-driven, not intuition-driven
Incremental improvement, not one-time refactoring
Automation doesn’t mean giving up control
Failure is fuel for improvement

Frequently Asked Questions (FAQ)

Q1: Won’t self-improving systems make skills increasingly complex?

Not necessarily. The system optimizes based on actual performance, and sometimes simplifying instructions actually improves success rates. The key is having comprehensive evaluation metrics, looking not just at success rates but also maintainability.

Q2: What if the system proposes incorrect modification suggestions?

This is exactly why evaluation and rollback mechanisms are necessary. Any modification must undergo rigorous A/B testing, and if the effect isn’t good, you can immediately roll back to the previous version.

Q3: Is this system suitable for all types of skills?

Theoretically, it’s suitable for all skills that change over time. But for very stable, rarely changing skills, the ROI may not be high. Prioritize application to frequently used skills in rapidly changing environments.

Q4: How many failure records are needed before improvement can begin?

There’s no fixed number. For high-frequency failures, a few records may be sufficient; for low-frequency but important failures, more data may be needed. The key is whether the failure pattern is clear.

Q5: How do I choose between human review and automatic application?

A hybrid approach is recommended: low-risk modifications (such as format adjustments) can be automatically applied; high-risk modifications (such as core logic changes) require human review. As the system matures, automation levels can be gradually increased.

Q6: How do I evaluate the ROI of skill improvements?

Measure from three dimensions: reduced manual maintenance time, improved skill success rates, and reduced user complaints. Establish baseline data and regularly compare improvement effects.

Q7: Won’t this system introduce new bugs?

Any system can introduce bugs, but through rigorous evaluation and rollback mechanisms, risks can be minimized. The key is not to skip evaluation steps and ensure every modification is verified.

Q8: How long does it take to implement this system?

It depends on existing infrastructure. If you already have logging systems and data storage, a basic version can be built in 2-4 weeks. If starting from scratch, it may take 2-3 months. Phased implementation is recommended—first implement observation and inspection, then gradually add modification and evaluation capabilities.

Ready to build self-improving AI agents? Start by implementing observation mechanisms today, and watch your skills evolve from static prompts into living, learning system components.

Self-Improving AI Agents: How Closed-Loop Systems Fix Broken Prompts Automatically