DS-STAR: Google’s Multi-Agent Breakthrough That Teaches AI to Think Like a Data Scientist

How a new framework transforms messy CSVs, JSON files, and text documents into reliable Python code without human intervention

Imagine walking into your office to find a zip file containing seven different data formats—CSV tables, nested JSON files, markdown documents, and unstructured text logs. Your boss asks you to “find insights” from this data jumble. A typical data scientist would spend hours manually inspecting files, writing exploratory code, debugging errors, and iterating on their analysis plan. Now, Google Cloud and KAIST researchers have developed DS-STAR, an AI system that automates this entire process by mimicking how human experts actually work: through iterative planning, execution, and self-verification.

The Hidden Complexity Behind Data Science Automation

Data science fundamentally transforms raw information into actionable business intelligence. Companies rely on these insights for strategic decisions that shape their future. Yet the workflow remains notoriously complex, demanding deep expertise across computer science, statistics, and domain knowledge. The process involves time-consuming tasks: interpreting scattered documents, performing intricate data transformations, and conducting statistical analyses that often span 220+ lines of code divided into multiple sequential steps.

Existing AI tools, despite impressive advances, hit three critical walls:

First, they obsess over well-structured data like relational CSV tables, ignoring the reality of enterprise environments where information lives across heterogeneous formats—JSON configurations, markdown documentation, and unstructured logs. When faced with a JSON array containing nested merchant data or a markdown file with embedded tables, most agents simply fail.

Second, verification is nearly impossible. Data science tasks are open-ended questions without ground-truth labels. When an agent like Data Interpreter successfully executes code, it stops—even though runnable code doesn’t guarantee correct answers. This lack of robust verification leads to sub-optimal analysis plans that miss crucial steps or make flawed assumptions.

Third, error handling is primitive. A single missing column or schema mismatch crashes the entire pipeline, requiring manual debugging that defeats the purpose of automation.

DS-STAR breaks through these barriers by reframing data science automation from a code generation problem into an iterative search for verifiable solutions.

Introducing DS-STAR: The Data Science Agent That Checks Its Own Work

DS-STAR (Data Science Agent via Iterative Planning and Verification) introduces three fundamental innovations that distinguish it from existing approaches:

1. Universal Data File Analysis

Rather than treating data as a monolithic block, DS-STAR generates a “digital fingerprint” for every file. For each document in your dataset, a specialized Analyzer agent creates a Python script that extracts structural metadata: column names and types for CSVs, key hierarchies for JSON, text summaries for unstructured files, and sheet structures for Excel workbooks. These descriptions become shared context for all subsequent agents, enabling them to understand the data landscape without loading every byte into memory.

2. LLM-Based Verification

At each stage, a Verifier agent—acting as an impartial judge—evaluates whether the current analysis plan actually solves the user’s question. This isn’t a syntax checker; it’s a reasoning evaluator that examines the cumulative plan, generated code, and execution results to render a binary verdict: sufficient or insufficient. This judgment grounds the entire process, preventing the “it runs, ship it” trap.

3. Sequential Iterative Refinement

Instead of generating massive plans upfront, DS-STAR starts with a simple, executable first step (like loading a specific CSV). After each execution, it reviews intermediate results before deciding the next move. A Router agent determines whether to append a new step or correct a flawed previous one, enabling graceful backtracking when errors emerge. This mirrors how senior analysts use Jupyter notebooks—inspect outputs, adjust strategy, repeat until confident.

The Two-Stage Architecture: From Understanding to Solution

Stage 1: Automated Data Discovery

The process begins with parallel file analysis. For each data file Dᵢ, the Analyzer agent generates and executes a descriptive script:

description_i = execute(Analyzer(Di))

These scripts are self-contained Python programs that load the file and print essential information: for a merchant_data.json file, it outputs top-level type, element count, column names, and sample rows; for an Excel workbook, it lists sheet names, dimensions, column headers, and data types. The collection {dᵢ} becomes the foundation for all planning decisions.

Stage 2: The Iterative Execution Cycle

The core loop orchestrates five specialized agents into a coherent workflow:

Planner: Given the user’s question and current data descriptions, proposes the next logical step. Initially, this might be “Load February 2023 transactions for merchant Rafa_AI.” After seeing results, it might suggest “Enrich transaction data with merchant category codes from JSON.”

Coder: Translates the plan into executable Python code. Critically, it receives not just the current step but the entire plan history and previous code implementation, enabling incremental development rather than rewriting from scratch each time.

Executor: Runs the generated script in a sandboxed environment, capturing both outputs and error tracebacks.

Verifier: The LLM-based judge reviews the cumulative plan, code, and execution results against the original question. Its prompt is simple: “Is this sufficient?” A “No” answer triggers refinement.

Router: When verification fails, Router decides the correction strategy:

  • “Add Step”: The plan is incomplete but correct; proceed sequentially
  • “Step N is wrong”: Truncate the plan at step N-1 and regenerate from there

Finalyzer: Once the Verifier approves, this agent ensures conformance to output format requirements (CSV rounding, JSON structure, etc.).

The loop continues for up to 20 rounds, with each iteration grounded in actual execution feedback rather than theoretical correctness.

Deep Dive: The Algorithm That Makes It Work

The paper’s Algorithm 1 reveals the elegant simplicity:

  1. Input: User query q, data files D, maximum rounds M
  2. File Analysis: Generate descriptions dᵢ for all N files
  3. Initialize: Plan p₀, code s₀, result r₀
  4. Iterate: For k from 0 to M-1:

    • Verdict = Verifier(p, q, sₖ, rₖ)
    • If “sufficient”: break and return solution
    • Else: Router decides to add or truncate
    • Planner generates next step pₖ₊₁
    • Coder implements updated plan → sₖ₊₁
    • Execute → rₖ₊₁
  5. Output: Final verified solution

This transforms data science into a search problem over executable script space, where each iteration prunes invalid paths based on empirical validation.

Benchmark Wars: How DS-STAR Stacks Up

Three challenging benchmarks validate DS-STAR’s effectiveness:

DABStep: Multi-Step Data Reasoning

The DABStep benchmark contains 450 realistic tasks using seven heterogeneous files. Hard tasks require analyzing multiple sources and applying domain knowledge, averaging 220 lines of code across four sequential steps.

Framework Model Easy Tasks Hard Tasks
Gemini-2.5-Pro (standalone) 66.67% 12.70%
ReAct Claude-4-Sonnet 81.94% 19.84%
Data Interpreter Gemini-2.5-Pro 72.22% 3.44%
DA-Agent Gemini-2.5-Pro 68.06% 22.49%
DS-STAR Gemini-2.5-Pro 87.50% 45.24%

The 32.54-point jump on hard tasks demonstrates DS-STAR’s superiority in handling multi-source complexity. It currently ranks #1 on the public DABStep leaderboard.

KramaBench: Finding Needles in Data Lakes

This benchmark tests autonomous data discovery across six domains with up to 1,556 files per domain. DS-STAR integrates a Retriever that selects the top 100 most relevant files via embedding similarity.

Setting Framework Total Score
With Retrieval DA-Agent 39.79%
With Retrieval DS-STAR 44.69%
Oracle (perfect retrieval) DA-Agent 48.61%
Oracle (perfect retrieval) DS-STAR 52.55%

The 8-point gap between retrieval and oracle settings shows room for improvement in file selection, but DS-STAR’s absolute performance remains state-of-the-art.

DA-Code: General Data Science Tasks

DA-Code spans 500 tasks across data wrangling, machine learning, and exploratory data analysis. In the hard category, DS-STAR achieves 37.1% accuracy versus DA-Agent’s 32.0%—a significant edge when both use identical Gemini-2.5-Pro backends.

What Makes It Tick: Component Ablation Studies

The Data File Analyzer: Context is King

Removing the Analyzer causes hard-task accuracy on DABStep to plummet from 45.24% to 26.98%. This validates that rich file descriptions are critical for planning. Without understanding column structures, data types, and content samples, the Planner agent operates blind.

In a qualitative case study, ReAct failed to filter “NextPay” transactions because it couldn’t locate the relevant column. DS-STAR’s Analyzer correctly identified the card_scheme column, enabling accurate filtering.

The Router: Correction Beats Accumulation

Eliminating the Router forces sequential step addition without error correction. Performance degrades on both easy and hard tasks because building on flawed steps creates cascading failures. The Router’s ability to truncate and regenerate from failure points is more effective than blindly accumulating steps.

The Cost of Intelligence: Token Consumption Reality Check

Superior performance comes at a price:

Method Input Tokens Cost Per Task
ReAct 44,691 $0.09
DA-Agent 39,999 $0.09
DS-STAR 154,669 $0.23

The 3.5x token increase stems from comprehensive file descriptions fed into each iteration. At Gemini-2.5-Pro pricing (10.00 per million output), DS-STAR remains economically viable for professional use—$0.23 for analyst-grade work is negligible compared to human hourly rates.

Real-World Execution: A Five-Round Case Study

The paper’s Appendix F provides a full execution log for this question: “In February 2023, if the relative fee ID=17 changes its rate to 99, what delta would merchant Rafa_AI pay?”

Round 0: Load February transactions → Insufficient

Round 1: Calculate original fees by enriching data with merchant profiles and matching fee rules → Insufficient

Round 2: Attempt to calculate delta by modifying fee rule 17 → Step 3 is wrong

Round 3: Refine delta calculation logic → Step 3 is wrong

Round 4: Correctly identify that only transactions matching all conditions of fee ID=17 should be considered → Insufficient

Round 5: Calculate original vs new fees for affected transactions, derive delta → Sufficient

Final answer: 2.6777 EUR increase. This showcases DS-STAR’s ability to self-correct and incrementally deepen its analysis based on verification feedback.

Robustness Features for Production Deployment

Auto-Debugger: Fixing Data-Centric Errors

When scripts fail, the Debugger agent receives:

  1. The broken code
  2. Error traceback
  3. File descriptions {dᵢ}

Traditional debuggers rely on stack traces alone. DS-STAR’s Debugger uses column headers, sheet names, and schema information to generate fixes. For example, if code references a non-existent column, Debugger can suggest the correct name from the Analyzer’s description.

Smart Retriever: Scaling to Massive Data Lakes

For domains with >100 files, the Retriever uses Gemini-Embedding-001 to compute similarity between the user’s query and each file description. Only the top-K (K=100) most relevant files enter the agent’s context window. This enables handling KramaBench’s astronomy domain (1,556 files) without overwhelming the LLM.

Cross-Model Generalizability

DS-STAR’s architecture isn’t tied to a single LLM backbone. Testing with GPT-5 reveals:

Model Easy Tasks Hard Tasks
Gemini-2.5-Pro 87.50% 45.24%
GPT-5 88.89% 43.12%

The framework adapts gracefully, though different models show task-specific strengths. GPT-5 excels at straightforward tasks, while Gemini-2.5-Pro demonstrates superior reasoning on complex multi-step problems.

Current Boundaries and Future Horizons

Limitations

  1. Full Automation: The current system operates independently. Integrating human intuition and domain expertise remains an open challenge.
  2. Cost-Performance Tradeoff: While $0.23/task is reasonable, large-scale deployments must budget for token consumption.
  3. Retrieval Precision: In massive data lakes, missing relevant files during retrieval limits ultimate performance.

Promising Directions

Human-in-the-Loop Integration: A compelling next step involves synergistically combining DS-STAR’s automation with expert guidance. Imagine a workflow where the Agent handles 80% of routine analysis, flagging ambiguous cases for human review. This hybrid model could dramatically boost accuracy while retaining efficiency.

Advanced Retrieval: Semantic search improvements and metadata enrichment could shrink the gap between retrieval and oracle performance on KramaBench.

Domain Specialization: Fine-tuning verifiers and planners for specific industries (finance, healthcare, retail) could elevate accuracy further by encoding domain constraints.

Frequently Asked Questions

Q: How does DS-STAR differ from asking ChatGPT to write analysis code?

A: Standard LLMs generate one-shot solutions without self-verification. DS-STAR’s Verifier agent acts as an internal critic that checks if the plan actually answers the question, while the Router enables course correction. It’s the difference between a student submitting first-draft homework versus a professional who reviews and revises their work.

Q: Can DS-STAR handle SQL databases too?

A: The framework is designed for Text-to-Python over heterogeneous files. While it could generate SQL code, its strength lies in processing JSON, Markdown, Excel, and unstructured text—formats that traditional Text-to-SQL systems cannot handle.

Q: What happens when data contains sensitive information?

A: The paper doesn’t address privacy explicitly. However, since DS-STAR can be deployed locally with open-weight models, organizations can process confidential data within secure environments without transmitting it to external APIs.

Q: How many iterations does a typical analysis require?

A: Easy tasks average 3.0 rounds, with over 50% completing in one iteration. Hard tasks need 5.6 rounds on average. The 20-round maximum is rarely reached, suggesting most problems resolve within single-digit iterations.

Q: Is DS-STAR ready for production use?

A: For enterprise scenarios where 45% accuracy on hard tasks is valuable (e.g., exploratory analysis, prototyping), yes. For mission-critical decisions requiring near-perfect accuracy, it currently serves best as a productivity amplifier rather than a full replacement.

Q: What makes the Verifier agent reliable?

A: The Verifier bases judgments on actual execution results, not just plan descriptions. By conditioning on both code and output, it can detect logical gaps that static analysis would miss. However, like any LLM judge, it’s not infallible and benefits from the iterative refinement loop.

Industry Implications: Toward Democratized Data Science

DS-STAR’s practical impact extends beyond benchmark scores:

For Small Businesses: Organizations without dedicated data science teams can extract value from disparate data sources without hiring specialists. A $0.23 per-task cost makes sophisticated analysis accessible.

For Enterprise Teams: Senior analysts can delegate routine data wrangling to DS-STAR, focusing their expertise on interpretation, stakeholder communication, and strategic recommendations.

For Education: Students learning data science can study DS-STAR’s execution logs to understand professional workflows, accelerating skill development.

For AI Research: The “plan-verify-iterate” paradigm offers a template for tackling other open-ended problems where ground truth is unavailable, from scientific research to creative writing.

Final Verdict: A Paradigm Shift in Data Science Automation

DS-STAR doesn’t just generate code—it orchestrates a thoughtful analysis process. By combining file analysis, iterative planning, execution verification, and error correction, it approximates the cognitive loop of human experts:

  1. Understand the data landscape through structured descriptions
  2. Plan a sequence of steps toward the goal
  3. Execute and observe intermediate results
  4. Verify progress against the original question
  5. Refine by correcting errors or extending the plan

The 45.24% hard-task accuracy on DABStep, while imperfect, represents a massive leap over standalone LLMs and existing agents. More importantly, it establishes a framework where performance scales with iteration depth and component quality.

As the research team notes, future human-in-the-loop integrations could unlock even greater potential, blending AI automation with human intuition to create systems that are greater than the sum of their parts.

For now, DS-STAR stands as the state-of-the-art in autonomous data science—a versatile, verifiable, and iteratively intelligent agent that turns data chaos into structured insight.


Technical foundation: Based exclusively on “DS-STAR: Data Science Agent via Iterative Planning and Verification” (arXiv:2509.21825v3), Google Cloud Research Blog, and Marktechpost analysis. All performance metrics, case studies, and architectural details derived from these primary sources without external supplementation.