VitaBench: The Future of Real-World AI Agent Evaluation

高效码农

3 months ago

🌱 VitaBench: Redefining How We Evaluate Real-World AI Agents

When even the most powerful AI models achieve less than 30% success on complex real-world tasks, how do we measure and advance the next generation of intelligent agents?

The Problem: Why Current AI Benchmarks Fall Short

Large Language Models (LLMs) have made impressive strides in tool usage, reasoning, and multi-turn conversations. From OpenAI’s GPT series to Anthropic’s Claude and Google’s Gemini, every major model claims breakthrough capabilities as “intelligent assistants.” However, when we deploy these models in actual business scenarios, we discover a troubling reality:

Lab performance ≠ Real-world effectiveness

Existing agent benchmarks like ToolTalk, MINT, and τ-Bench provide valuable assessments of basic tool-calling accuracy, but they miss three critical dimensions of real-world applications:

Information complexity requiring integration of spatial, temporal, and commonsense knowledge
Tool dependencies where real business APIs form complex dependency graphs
User uncertainty with ambiguous intentions, changing behaviors, and evolving needs

To bridge this gap, Meituan’s LongCat Team introduces VitaBench—a comprehensive benchmark focused on life-serving applications. The name “Vita” (Latin for “life”) reflects its deep connection to real-world service scenarios.

What Makes VitaBench Different?

VitaBench creates the most sophisticated life-service simulation environment to date, with several groundbreaking features:

Three Core Business Domains

Food & Product Delivery
In-store Consumption
Online Travel Services

Unprecedented Scale & Complexity

66 integrated tools covering read, write, and general operations
400 carefully designed tasks (100 cross-domain + 300 single-domain)
Massive databases with 1,324 service providers, 6,946 products, and 447 transactions

Real-World Data Foundation

Each task derives from multiple authentic user requests, manually reviewed and refined to preserve real-world ambiguity while maintaining multiple valid solution paths.

The Three-Dimensional Complexity Framework

VitaBench’s core theoretical contribution is a comprehensive framework for understanding agent task complexity:

1. Reasoning Complexity (𝒞_reason)

Quantifies the cognitive demands of processing information in partially observable environments:

# Example metrics
η = 1 - |𝒪|/|𝒮|  # Partial observability degree
H(𝒪)              # Observation space entropy

Real-world manifestations include:

Multi-constraint integration simultaneously addressing time, location, budget, and preference constraints
Implicit requirement inference identifying unstated preferences from user dialogue
Long-horizon planning coordinating multiple interdependent sub-tasks

2. Tool Complexity (𝒞_tool)

Models tool sets as directed graphs to quantify structural complexity:

G = (V, E)  # Tool dependency graph
|V| = 66    # Tool count
|E| = 512   # Dependency edges  
ρ = |E|/(|V|(|V|-1))  # Edge density

This graph-based design naturally encodes domain rules without verbose policy documents. For example, the modify_order tool requires prior execution of get_order_detail, reflecting authentic workflow dependencies.

3. Interaction Complexity (𝒞_interact)

Captures dynamic challenges in multi-turn conversations:

User profile modeling incorporating age, gender, dietary restrictions
Behavioral attribute variation including cooperation levels, goal ambiguity, emotional expression
Dynamic state evolution tracking changing user preferences and intentions throughout dialogue

Innovative Evaluation: Rubric-Based Sliding Window Assessment

Evaluating long-horizon agent behavior presents significant challenges. VitaBench introduces a novel assessment approach:

Atomic Rubric Design

Manually crafted atomic evaluation criteria for each task:

Example rubrics:
- Restaurant within 500m of port
- User only eats vegetarian food  
- Order must arrive by 11:30 AM

Sliding Window Processing

Segments long trajectories into overlapping dialogue windows
Maintains consistent rubric state tracking across windows
Overcomes model context length limitations

Strict Scoring Protocol

Employs all-or-nothing scoring: tasks only succeed when all rubrics are satisfied.

This approach achieves high inter-rater agreement with human evaluators (Cohen’s κ ≥ 0.81), ensuring assessment reliability.

Key Findings: The Current Limits of AI Capability

VitaBench evaluation results reveal significant limitations in today’s most advanced models:

Overall Performance Landscape

In cross-domain tasks, even the best-performing model (o3-high) achieves only 30.0% success rate. For single-domain tasks, the top model (Claude-4.1-Opus) reaches just 48.3% success rate.

Thinking vs. Non-Thinking Models

Thinking models consistently demonstrate advantages:

Claude-4.1-Opus improves from 21.8% (non-thinking) to 29.0% (thinking)
GLM-4.5 advances from 20.0% to 22.8%

More importantly, thinking models achieve better performance in fewer dialogue turns, demonstrating superior efficiency.

Critical Stability Issues

Pass@k versus Pasŝk metrics reveal fundamental stability challenges:

Pass@4 measures probability of at least one success (encouraging exploration)
Pasŝ4 measures probability of all four successes (assessing consistency)

Even top models showing strong Pass@4 performance see their Pasŝ4 scores plummet toward zero, indicating output consistency remains a critical unsolved problem.

Deep Analysis: Error Patterns and Improvement Opportunities

Detailed analysis of 76 failed rubrics identifies three primary error categories:

Reasoning Failures (61.8%)

Spatio-temporal reasoning breakdowns failing to coordinate geographic and temporal constraints
Commonsense reasoning gaps overlooking obvious business logic or physical limitations
Multi-constraint integration difficulties struggling with conflicting requirements

Tool Usage Errors (21.1%)

Incorrect tool selection choosing inappropriate APIs in complex tool graphs
Parameter passing mistakes format errors or missing required parameters
Poor failure recovery lacking alternative strategies after tool invocation failures

Interaction Management Issues (7.9%)

Insufficient proactive clarification not querying ambiguous requirements
Preference tracking loss forgetting user preferences expressed earlier in conversations
Poor strategy adaptation failing to adjust interaction approaches based on user feedback

Getting Started with VitaBench

Installation & Setup

git clone https://github.com/meituan/vitabench.git
cd vitabench
pip install -e .

# Configure model parameters
export VITA_MODEL_CONFIG_PATH=/path/to/your/models.yaml

Sample Model Configuration

models:
  - name: gpt-4.1
    max_tokens: 4096
    temperature: 0.0
    thinking:
      type: "enabled"
      budget_tokens: 4000
    cost_1m_token_dollar:
      prompt_price: 10.0
      completion_price: 30.0

Running Evaluations

# Cross-domain evaluation
vita run --domain delivery,instore,ota \
         --user-llm gpt-4.1 \
         --agent-llm claude-3.7-sonnet \
         --enable-think \
         --evaluator-llm claude-3.7-sonnet \
         --num-tasks 10 \
         --max-steps 300 \
         --csv-output results.csv

Analysis & Visualization

# View detailed trajectories
vita view --file data/simulations/simulation_001.json

# Re-evaluate specific trajectories
vita run --re-evaluate-file data/simulations/simulation_001.json \
         --evaluation-type strict \
         --save-to reevaluated_simulation.json

Real Task Example: Cross-Domain Family Trip Planning

This representative VitaBench task demonstrates the benchmark’s complexity:

User Profile:

Occupation: Blue-collar worker
Personality: Cold and concise expression, lacks emotional communication
Dietary restrictions: Avoids high-purine foods, fried foods

Task Instruction:
“This summer, your three-generation family is taking a cruise trip with final preparations needed. On the 27th at 3 PM, you’ll board the ship in Dalian. Find a restaurant near the port for a family gathering first—suitable for three generations, with accessibility facilities and elderly/child-friendly dishes. After selection, book a table for 6 people at 12 noon that day. Prepare special travel items for elderly family members including a walking cane and adult diapers, but bringing them would be troublesome, so arrange delivery to the restaurant around 12 noon to take directly onto the ship. Your aunt is coming from Beijing via high-speed train to meet up. Help purchase a suitable morning train ticket for that day—she prefers first class and arrival before 11 AM.”

Complexity Analysis:

Multi-domain coordination restaurant booking + product delivery + ticket purchase
Spatio-temporal constraints boarding time, delivery timing, train arrival
Special requirements accessibility facilities, elderly/child friendly items, travel necessities
User characteristics cold personality affecting interaction strategy

Key Takeaways and Future Directions

VitaBench represents a significant shift in agent evaluation paradigms—from isolated tool-calling accuracy to comprehensive assessment of real-world application complexity. Our findings indicate that even state-of-the-art models face substantial challenges when confronting real-world complexity.

Core Insights

Cross-domain coordination remains a bottleneck models perform adequately in single domains but success rates plummet in cross-scenario tasks
Reasoning capabilities need improvement over 60% of failures stem from complex multi-constraint reasoning
Stability requires attention model output consistency lags far behind single-attempt success rates
Thinking mechanisms show promise explicit reasoning processes improve both effectiveness and efficiency

Research Opportunities

VitaBench provides a rich testing ground for:

Reinforcement learning improving agent strategies through environmental feedback
Planning algorithms enhancing long-horizon task decomposition and coordination
Error recovery developing better failure recovery and strategy adjustment capabilities
Personalized interaction dynamically adapting interaction styles based on user profiles

We believe VitaBench will become an essential resource for advancing the next generation of practical AI agents, helping the research community bridge the gap between laboratory performance and real-world application.

Access VitaBench Resources

Paper: arXiv:2509.26490
Website: https://vitabench.github.io/
Codebase: https://github.com/meituan/vitabench
Dataset: HuggingFace
Leaderboard: Live Updates

@article{he2025vitabench,
  title={VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications},
  author={He, Wei and Sun, Yueqing and Hao, Hongyan and Hao, Xueyuan and Xia, Zhikang and Gu, Qi and Han, Chengcheng and Zhao, Dengchang and Su, Hui and Zhang, Kefeng and Gao, Man and Su, Xi and Cai, Xiaodong and Cai, Xunliang and Yang, Yu and Zhao, Yunke},
  journal={arXiv preprint arXiv:2509.26490},
  year={2025}
}