DoVer Auto-Debugging: How to Fix 27.5% of LLM Multi-Agent Failures

高效码农

2 months ago

Snippet
DoVer (Do-then-Verify) is an intervention-driven auto-debugging framework for LLM Multi-Agent Systems. It employs a “hypothesize-intervene-verify” closed-loop to overcome the limitations of log analysis, which often suffers from inaccurate attribution and lack of validation. Experiments show DoVer successfully fixes 17.6% to 27.5% of failed tasks on AssistantBench and GAIA within the Magentic-One framework, and achieves a 49.0% fix rate on the GSMPlus dataset using AutoGen2. It validates or refutes 30% to 60% of fault hypotheses, offering a quantifiable path to enhancing AI system reliability.

DoVer Framework Explained: How to Automatically Debug and Repair Failures in LLM Multi-Agent Systems

The evolution of Large Language Models (LLMs) has transitioned rapidly from simple conversational agents to sophisticated Multi-Agent Systems (MAS). Frameworks like Microsoft’s Magentic-One and AutoGen2 allow a collaborative team of AI roles—such as a planner, a web browser, and a code executor—to tackle complex, multi-step problems.

However, this complexity introduces a critical challenge for developers and researchers: How do you debug when the system fails?

Unlike conventional software, which usually crashes with a clear error code, an AI agent’s failure is often silent. The system may produce a final output that is factually incorrect or incomplete, without any clear sign of which agent or which step was responsible. Sifting through hundreds of lines of interaction logs manually is inefficient, often impossible, and rarely leads to a verified root cause.

This deep-dive technical article explores DoVer (Do-then-Verify), a novel, intervention-driven auto-debugging framework designed specifically for LLM Multi-Agent Systems. Drawing exclusively from original research, we will detail how DoVer leverages active intervention to test fault hypotheses, quantify debugging success, and effectively revive previously failed tasks. This approach shifts debugging from a guessing game to a verifiable, engineering-focused process.

The Fatal Flaw of Traditional Log-Based Debugging in MAS

Before diving into DoVer, we must recognize why the current industry standard—Log-based Failure Attribution—is fundamentally insufficient for multi-agent environments. This method involves feeding the full conversation log to another LLM and asking it to pinpoint the specific erroneous step or responsible agent.

While conceptually simple, this approach is marred by two core deficiencies that undermine its utility:

1. Hypotheses Without Verification are Just Guesses

When an LLM reviews a log and suggests a “bad step,” it is generating a hypothesis, not a confirmed fact. Without a practical, executable test to confirm the finding, there is no way to know if modifying that step will genuinely fix the problem or if the error lies elsewhere in the system’s execution chain. Debugging, in this context, becomes an endless cycle of trial and error based on speculation.

2. The Core Problem: Uncertainty in Ground Truth

Crucial findings within the research reveal a significant issue: even human annotators struggle to agree on “what went wrong” and “where the mistake occurred.” This inherent uncertainty in the Ground Truth is the single greatest impediment to log-based attribution.

Analysis of the Who&When (WW) dataset shows that in 14 out of 29 cases (nearly 50%) from the GAIA dataset, human experts found the ground truth to be uncertain. This ambiguity is primarily driven by three factors inherent to complex MAS architectures:

A. Complexity of Multiple Trials and Re-planning

Modern agent systems, often employing architectures like ReAct, engage in multiple “plan-execute” loops within a single session. For example, an agent might first attempt a direct web scroll (Trial 1), and upon failure, re-plan to use a calendar tool (Trial 2). Each trial represents a distinct, sequential causal chain of actions. Attempting to assign a single failure point to an entire, multi-trial session is often an oversimplification that masks the true complexity of the error.

B. Inter-Agent Misalignment and Responsibility Blurring

Failures often occur at the seams where agents collaborate. If the “Orchestrator” agent issues a vague or incorrect instruction, leading the “WebSurfer” agent to click the wrong link, where does the fault truly lie? Is it the Orchestrator’s poor instruction generation, or the WebSurfer’s weak execution capability? This blurring of responsibility makes singular attribution highly problematic and unreliable.

C. Inconsistent Human Annotations

Even after extensive discussion, human experts could not reach full consensus on the root cause in 7 of the uncertain cases. This evidence suggests that the very definition of the “error step” in an MAS context is subjective and dynamic, rendering automated log-based attribution—which relies on pattern recognition—highly unreliable.

Experimental data reinforces this finding: For cases where the ground truth was deemed uncertain, the attribution accuracy of even a powerful LLM like GPT-5 dropped precipitously from 53% in certainty cases to a mere 7%. This quantitative gap confirms that relying solely on log analysis is a deeply flawed strategy for complex multi-agent debugging.

How-To: The Four Stages of the DoVer Intervention Framework

DoVer shifts the paradigm by introducing an Intervention-Driven debugging methodology. The guiding principle is: Stop guessing the error based on passive logs; actively modify the system’s behavior in-situ and verify the outcome. If the system succeeds after the intervention, the initial hypothesis is confirmed.

The Do-then-Verify framework operates through four rigorous, measurable stages:

Stage 1: Trial Segmentation

The long, complex interaction log from a failed session is first broken down into smaller, manageable “Trials.”

Mechanism: An LLM analyzes the session log and identifies “re-planning” nodes or major shifts in the agent’s strategy, using these as natural breakpoints.
Benefit: This drastically reduces the context length for the subsequent attribution phase, allowing the LLM to focus on a single, clean causal trajectory. It also enables parallel intervention attempts on different failure paths within the same session.

Stage 2: Failure Attribution (Hypothesis Generation)

For each segmented “Trial,” DoVer generates a specific fault hypothesis, $h_{i}$ .

Output Components: The hypothesis includes the suspected erroneous step index ( $t^$ ), the agent responsible ( $a^$ ), and a natural language description of the error’s nature.
Key Distinction: Unlike traditional attribution, this step is merely a proposal. Its accuracy is secondary to its ability to propose a testable fix.

Stage 3: Intervention Generation

This is the core intellectual function of DoVer, where the abstract fault hypothesis is converted into a concrete, executable edit command. Interventions primarily target the Orchestrator level—modifying the messages that direct sub-agent behavior—and are categorized into two types:

1. Modified Instructions

This involves correcting the specific command sent to a child agent (e.g., WebSurfer).

Goal: To make the intent clearer, correct incorrect parameters, or supplement essential context that the Orchestrator failed to include originally.

2. Plan Updates

This constitutes a higher-level change to the task execution flow.

Goal: To re-order steps, explicitly break down a complex sub-task into simpler units, or completely bypass a known problematic pathway identified by the hypothesis.

Stage 4: Intervention Execution and Verification

The system utilizes a specialized Checkpoint mechanism to revert the system state back to the moment before the proposed error occurred.

Execution Flow: All preceding conversation history is maintained. The modified instruction or updated plan is in-situ injected at the error step. The system is then allowed to resume running from this new point.
Verification: This process generates a Counterfactual Trace.
- If the new trace successfully completes the task, the fault hypothesis is “Validated.”
- If the new trace still fails, the hypothesis is potentially “Refuted.”

Quantifying Success: DoVer’s Advanced Evaluation Metrics

In the world of MAS, a simple Pass/Fail binary is insufficient. DoVer introduces a nuanced set of metrics to provide high-resolution insight into debugging effectiveness, offering developers measurable data rather than anecdotal results.

1. The Success Rate of Task Repair

To mitigate the impact of LLM randomness, each intervention within DoVer is executed three times.

Trial Success Rate: The percentage of interventions that lead to the successful completion of the previously failed task. This is the most direct measure of the debugging framework’s utility.

2. Progress Made (Measuring Incremental Improvement)

Even if an intervention doesn’t lead to full success, did the agent get closer to the final goal? DoVer introduces the Progress Made metric to quantify substantial incremental improvement.

Milestone Extraction: An LLM is used to automatically extract $K$ tool-agnostic key milestones for any given task (e.g., “Find file,” “Download document,” “Read data”).
Quantification: The metric calculates the gain in completed milestones after the intervention, normalized by the total number of milestones $K$ .

$Progress (τ \to τ ~_{I}) = K A ( τ ~ _{I} ) - A ( τ )$

Where:

$τ$ : The original, failed execution trace.
$τ ~_{I}$ : The new execution trace after intervention $I$ .
$A (\cdot)$ : The number of milestones successfully achieved in a trace.
$K$ : The total number of required milestones for the task.

This formula provides an accurate, quantifiable measure of the substantive progress—for example, a +15.7% progress made means the agent completed an average of one and a half more key steps towards the goal after the fix.

3. Hypothesis Validation: The Diagnostic Breakdown

DoVer’s most valuable diagnostic output is the classification of the intervention’s result, offering four definitive labels for the original fault hypothesis:

Validation Category	Criteria (Based on 3 runs)	Diagnostic Meaning for Developers
Validated	At least 2 out of 3 runs succeed.	The hypothesis was correct. The attributed error was the root cause, and the proposed fix/intervention is effective.
Partially Validated	Runs fail, but at least 2 runs faithfully execute the intervention and show significant progress (Milestone progress $> 20%$ ).	The hypothesis was directionally correct, but the execution was constrained by a lack of underlying agent capability or tool limitations.
Refuted	The intervention was executed, but progress was insignificant (progress $< 20%$ ).	The hypothesis was incorrect. The error lies elsewhere, saving developers time from pursuing a false lead.
Inconclusive	The agent fails to execute the intervention instruction, or other unresolved issues occur.	Points to a system weakness. This often indicates a fundamental agent failure, such as tool malfunction or severe instruction-following inability.

Empirical Evidence: Quantified Performance Across Frameworks

DoVer’s effectiveness is not merely theoretical; it has been rigorously tested across distinct agent frameworks (Magentic-One and AutoGen2’s MathChat) and demanding, high-difficulty benchmarks.

Key Experimental Setup

Agent Architectures: Magentic-One (M1) and the MathChat system based on AutoGen2 (AG2).
Benchmarks:
- WW-AB (AssistantBench): General-purpose complex tasks.
- WW-GAIA (GAIA benchmark): High-difficulty tasks involving file processing, web browsing, and code execution.
- GAIA-Level-1: A validation set of GAIA tasks.
- GSMPlus: A challenging mathematical reasoning dataset.
Primary Models: GPT-4o for trace generation and intervention, with GPT-5 used as the evaluation judge. The study also included open-source alternatives.

Table 1: Failure to Success Rate (The “Fixing” Power)

Dataset	Number of Intervention Trials	Trial Success Rate (Fixing Failures)	Average Progress Made Increment
WW-AB (Magentic-One)	72	17.6%	+0%
WW-GAIA (Magentic-One)	99	17.6%	+8.8%
GAIA-Level-1 (Magentic-One)	63	27.5%	+15.7%
GSMPlus (AutoGen2)	198	49.0%	N/A

Data Analysis:

High-Confidence Repair: The GSMPlus dataset results are the most striking, demonstrating an outstanding 49.0% failure repair rate for mathematical reasoning tasks. This proves DoVer’s efficiency in correcting logical or step-by-step planning errors.
Challenging Domains: Despite the high complexity of tasks in GAIA and AssistantBench—which involve external tool dependencies (web/files)—DoVer consistently repaired between 17.6% and 27.5% of failed cases.
Measurable Value: On the GAIA-Level-1 set, the +15.7% increment in “Progress Made” is highly significant. This translates to the agent achieving an average of over one and a half additional critical milestones after the intervention, providing substantive value even when full success isn’t achieved.

Table 2: The Diagnostic Power (Hypothesis Validation)

This table shows the distribution of outcomes for the original fault hypothesis, demonstrating DoVer’s strength as a diagnostic tool:

Category	WW-AB	WW-GAIA	GAIA-Level-1
Validated	15.3%	16.2%	34.9%
Partially Validated	4.2%	5.1%	12.7%
Refuted	13.9%	21.2%	23.8%
Inconclusive	66.7%	57.6%	28.6%

Deep Interpretation of Diagnostics:

Effective Validation: DoVer successfully Validated up to 34.9% of hypotheses, confirming a significant portion of the LLM’s initial attribution was correct and that the intervention fixed the issue.
Error Correction: The framework Refuted between 13.9% and 23.8% of the hypotheses. This function is arguably as important as the ‘Validated’ result, as it steers developers away from hours spent debugging the wrong component or step.
Pinpointing Capability Gaps: The high rate of Inconclusive outcomes in the WW-AB (66.7%) and WW-GAIA (57.6%) datasets is crucial. Analysis revealed that this category primarily exposed Sub-agent Capability Bottlenecks. For instance, the WebSurfer agent frequently failed to execute the correct intervention because it lacked a foundational tool (e.g., a “scroll to bottom” function or a proper PDF parsing library). The Inconclusive result transforms an abstract “failure” into a concrete, actionable requirement: The agent needs a new tool or better internal logic.

Case Studies: DoVer in Action (Experience-Driven Content)

To truly understand the value of an intervention-driven approach, let’s examine two real-world failure scenarios and how DoVer handled them.

Case 1: Validated Success—Fixing a Web Navigation Loop

This example illustrates a successful fix, confirming the initial hypothesis and leading to task completion.

Task: Identify the construction company for a Chicago landmark indirectly mentioned in the NASA APOD (Astronomy Picture of the Day) entries between August 1 and August 7, 2015.
Original Failure: The WebSurfer agent got stuck in an ineffective scrolling loop on the APOD archive page, failing to locate the required date entries.
DoVer’s Diagnosis (Hypothesis): The WebSurfer was using a poor, undirected scrolling strategy that prevented it from finding the target.
Generated Intervention: The Orchestrator’s instruction to the WebSurfer was modified. The new directive explicitly commanded the agent to stop undirected scrolling and instead “directly search the APOD archive, restricting the date to August 1–7, 2015, and scanning for keywords like ‘city lights’ or ‘horizon’.”
Result (Counterfactual Trace): The agent executed the new, targeted search strategy, quickly locating the relevant APOD entry. It successfully identified the “Marquette Building” and its constructor, “Holabird & Roche.” The task was completed.
Conclusion: The case was marked as Validated. The original strategy was the root cause of the failure, and a targeted instruction modification was the successful corrective action.

Case 2: Refuted Hypothesis—Identifying Tool Deficiencies

This case is equally valuable, demonstrating how DoVer prevents developers from debugging the wrong component.

Task: Find the arrival time for the Tri-Rail train with the highest passenger count arriving at Pompano Beach on May 27, 2019.
Original Failure: The agent failed to extract the necessary information and output the final result.
DoVer’s Diagnosis (Hypothesis): The debugging LLM hypothesized that the failure stemmed from the WebSurfer not successfully opening a specific data file, even though the log suggested an attempt was made.
Generated Intervention: The system was forced to intervene with an explicit instruction to the WebSurfer: “Mandatorily open the specific data file again and attempt to parse it.”
Result (Counterfactual Trace): The agent diligently executed the forced file-opening operation. However, despite successfully performing the command, it still could not locate the required information, and the task failed again.
Conclusion: The case was marked as Refuted. This is a high-value diagnostic. It immediately informed the developer that the problem was not that the agent failed to open the file. Instead, the true failure root cause must lie in:
- The file contents being missing or corrupted.
- The agent’s internal logic for extracting and synthesizing information from that file being flawed.

Technical Integration and Open-Source Compatibility (FAQ/Schema)

The technical implementation of DoVer requires careful engineering to support the intervention and verification loop.

How-To: Integrating DoVer into AutoGen2 (AG2)

Implementing DoVer’s core functionality—Checkpointing and Replay—in an agent framework like AutoGen2 (AG2) requires specific technical modifications:

Lightweight Checkpoint Layer: A wrapping layer is added around AG2’s Conversation Manager. At every critical step, this layer serializes the entire system state, including all conversation history, agent configurations, and LLM parameters. This creates the necessary Checkpoint.
State Restoration and Injection: During the intervention phase, the system loads the target step’s checkpoint, meticulously reconstructs all objects in memory, and then directly injects the modified message object (the Intervention) into the conversation history.
Counterfactual Execution: The system then allows the agents to resume execution from this altered state, generating the new counterfactual trace. A developer-facing Web Debugging UI was even built to visualize this process, allowing users to select a step, edit the message, and observe the resulting counterfactual run.

FAQ: Is DoVer restricted to proprietary models?

Q: Can DoVer operate using open-source Large Language Models (LLMs)?

A: Yes, DoVer exhibits excellent compatibility with open-source models. The research specifically tested the Qwen3-8B and Qwen3-32B models. The results were highly positive:

The Qwen3-32B model achieved a failure recovery rate of 16.9% on the WW-GAIA benchmark. This figure is exceptionally close to the performance of GPT-4o, which achieved 17.6% on the same set.
The smaller Qwen3-8B model, when provided with a 3-shot prompting setup (three examples), saw its success rate improve from 11.3% to 14.3%.

This data confirms that the core intervention logic of DoVer is robust and can be effectively deployed on locally hosted or less-resource-intensive open-source models.

Q: Does DoVer intervene by modifying the agent’s actual code?

A: No, in the current framework, DoVer’s interventions are limited to textual modifications at the Orchestrator level (i.e., changing instructions and plans). It does not automatically modify the sub-agent’s underlying Python code, nor does it create new tools. This limitation explains why some cases fall into the “Inconclusive” category; if the fix requires a tool the agent simply doesn’t possess, modifying the instruction alone is insufficient.

Q: How does DoVer compare to methods like “Self-Refine” or “CRITIC” that make the model reflect on its own output?

A: DoVer offers a superior solution in multi-agent environments. Comparative testing on the WW-GAIA dataset showed that self-refinement methods like CRITIC achieved a 0% repair rate. The reason for this failure is that multi-agent traces are long and noisy. Attempting to make a model “reflect” and simply rewrite the final answer at the very end of a long, complex execution trajectory is highly unlikely to correct fundamental execution deviations that occurred dozens of steps earlier. DoVer’s strength is its in-situ intervention, which corrects the error exactly at the point it occurred in the sequence.

Q: Is DoVer suitable for all Multi-Agent Systems?

A: DoVer’s core principles (segmentation, attribution, verification) are universally applicable. However, for practical integration, the target agent system must meet two non-negotiable engineering requirements:

Complete Log Availability: The system must generate detailed, comprehensive logs of all agent-to-agent and agent-to-tool interactions.
Robust Checkpointing: The system must natively support the ability to save the complete execution state and restore execution from a specific step. Black-box or purely asynchronous systems may require significant architectural modification to adopt DoVer.

Conclusion: Elevating Debugging from Guesswork to Engineering

The DoVer framework represents a pivotal moment in the development of robust LLM Multi-Agent Systems. By replacing passive, log-based speculation with an active, intervention-driven closed loop, it provides developers with a quantifiable, verifiable, and highly efficient path to diagnose and repair failures.

The data confirms its value: successfully repairing a significant percentage of previously failed complex tasks and, crucially, validating or refuting a large portion of potential fault hypotheses. This ability to transform an abstract “system failed” notification into a concrete diagnostic output—such as “Hypothesis Refuted, investigate tool extraction logic” or “Inconclusive, agent lacks the required scroll tool”—is the true long-term value.

DoVer is a practical engineering solution that anchors the reliability of advanced AI systems in the rigorous methodology of scientific experimentation. It moves the discipline of LLM debugging out of the realm of mysticism and into the predictable domain of computer science.