🌱 VitaBench: Redefining How We Evaluate Real-World AI Agents
When even the most powerful AI models achieve less than 30% success on complex real-world tasks, how do we measure and advance the next generation of intelligent agents?
The Problem: Why Current AI Benchmarks Fall Short
Large Language Models (LLMs) have made impressive strides in tool usage, reasoning, and multi-turn conversations. From OpenAI’s GPT series to Anthropic’s Claude and Google’s Gemini, every major model claims breakthrough capabilities as “intelligent assistants.” However, when we deploy these models in actual business scenarios, we discover a troubling reality:
Lab performance ≠ Real-world effectiveness
Existing agent benchmarks like ToolTalk, MINT, and τ-Bench provide valuable assessments of basic tool-calling accuracy, but they miss three critical dimensions of real-world applications:
-
Information complexity requiring integration of spatial, temporal, and commonsense knowledge -
Tool dependencies where real business APIs form complex dependency graphs -
User uncertainty with ambiguous intentions, changing behaviors, and evolving needs
To bridge this gap, Meituan’s LongCat Team introduces VitaBench—a comprehensive benchmark focused on life-serving applications. The name “Vita” (Latin for “life”) reflects its deep connection to real-world service scenarios.
What Makes VitaBench Different?
VitaBench creates the most sophisticated life-service simulation environment to date, with several groundbreaking features:
Three Core Business Domains
-
Food & Product Delivery -
In-store Consumption -
Online Travel Services
Unprecedented Scale & Complexity
-
66 integrated tools covering read, write, and general operations -
400 carefully designed tasks (100 cross-domain + 300 single-domain) -
Massive databases with 1,324 service providers, 6,946 products, and 447 transactions
Real-World Data Foundation
Each task derives from multiple authentic user requests, manually reviewed and refined to preserve real-world ambiguity while maintaining multiple valid solution paths.
The Three-Dimensional Complexity Framework
VitaBench’s core theoretical contribution is a comprehensive framework for understanding agent task complexity:
1. Reasoning Complexity (𝒞_reason)
Quantifies the cognitive demands of processing information in partially observable environments:
# Example metrics
η = 1 - |𝒪|/|𝒮| # Partial observability degree
H(𝒪) # Observation space entropy
Real-world manifestations include:
-
Multi-constraint integration simultaneously addressing time, location, budget, and preference constraints -
Implicit requirement inference identifying unstated preferences from user dialogue -
Long-horizon planning coordinating multiple interdependent sub-tasks
2. Tool Complexity (𝒞_tool)
Models tool sets as directed graphs to quantify structural complexity:
G = (V, E) # Tool dependency graph
|V| = 66 # Tool count
|E| = 512 # Dependency edges
ρ = |E|/(|V|(|V|-1)) # Edge density
This graph-based design naturally encodes domain rules without verbose policy documents. For example, the modify_order tool requires prior execution of get_order_detail, reflecting authentic workflow dependencies.
3. Interaction Complexity (𝒞_interact)
Captures dynamic challenges in multi-turn conversations:
-
User profile modeling incorporating age, gender, dietary restrictions -
Behavioral attribute variation including cooperation levels, goal ambiguity, emotional expression -
Dynamic state evolution tracking changing user preferences and intentions throughout dialogue
Innovative Evaluation: Rubric-Based Sliding Window Assessment
Evaluating long-horizon agent behavior presents significant challenges. VitaBench introduces a novel assessment approach:
Atomic Rubric Design
Manually crafted atomic evaluation criteria for each task:
Example rubrics:
- Restaurant within 500m of port
- User only eats vegetarian food
- Order must arrive by 11:30 AM
Sliding Window Processing
-
Segments long trajectories into overlapping dialogue windows -
Maintains consistent rubric state tracking across windows -
Overcomes model context length limitations
Strict Scoring Protocol
Employs all-or-nothing scoring: tasks only succeed when all rubrics are satisfied.
This approach achieves high inter-rater agreement with human evaluators (Cohen’s κ ≥ 0.81), ensuring assessment reliability.
Key Findings: The Current Limits of AI Capability
VitaBench evaluation results reveal significant limitations in today’s most advanced models:
Overall Performance Landscape
In cross-domain tasks, even the best-performing model (o3-high) achieves only 30.0% success rate. For single-domain tasks, the top model (Claude-4.1-Opus) reaches just 48.3% success rate.
Thinking vs. Non-Thinking Models
Thinking models consistently demonstrate advantages:
-
Claude-4.1-Opus improves from 21.8% (non-thinking) to 29.0% (thinking) -
GLM-4.5 advances from 20.0% to 22.8%
More importantly, thinking models achieve better performance in fewer dialogue turns, demonstrating superior efficiency.
Critical Stability Issues
Pass@k versus Pasŝk metrics reveal fundamental stability challenges:
-
Pass@4 measures probability of at least one success (encouraging exploration) -
Pasŝ4 measures probability of all four successes (assessing consistency)
Even top models showing strong Pass@4 performance see their Pasŝ4 scores plummet toward zero, indicating output consistency remains a critical unsolved problem.
Deep Analysis: Error Patterns and Improvement Opportunities
Detailed analysis of 76 failed rubrics identifies three primary error categories:
Reasoning Failures (61.8%)
-
Spatio-temporal reasoning breakdowns failing to coordinate geographic and temporal constraints -
Commonsense reasoning gaps overlooking obvious business logic or physical limitations -
Multi-constraint integration difficulties struggling with conflicting requirements
Tool Usage Errors (21.1%)
-
Incorrect tool selection choosing inappropriate APIs in complex tool graphs -
Parameter passing mistakes format errors or missing required parameters -
Poor failure recovery lacking alternative strategies after tool invocation failures
Interaction Management Issues (7.9%)
-
Insufficient proactive clarification not querying ambiguous requirements -
Preference tracking loss forgetting user preferences expressed earlier in conversations -
Poor strategy adaptation failing to adjust interaction approaches based on user feedback
Getting Started with VitaBench
Installation & Setup
git clone https://github.com/meituan/vitabench.git
cd vitabench
pip install -e .
# Configure model parameters
export VITA_MODEL_CONFIG_PATH=/path/to/your/models.yaml
Sample Model Configuration
models:
- name: gpt-4.1
max_tokens: 4096
temperature: 0.0
thinking:
type: "enabled"
budget_tokens: 4000
cost_1m_token_dollar:
prompt_price: 10.0
completion_price: 30.0
Running Evaluations
# Cross-domain evaluation
vita run --domain delivery,instore,ota \
--user-llm gpt-4.1 \
--agent-llm claude-3.7-sonnet \
--enable-think \
--evaluator-llm claude-3.7-sonnet \
--num-tasks 10 \
--max-steps 300 \
--csv-output results.csv
Analysis & Visualization
# View detailed trajectories
vita view --file data/simulations/simulation_001.json
# Re-evaluate specific trajectories
vita run --re-evaluate-file data/simulations/simulation_001.json \
--evaluation-type strict \
--save-to reevaluated_simulation.json
Real Task Example: Cross-Domain Family Trip Planning
This representative VitaBench task demonstrates the benchmark’s complexity:
User Profile:
-
Occupation: Blue-collar worker -
Personality: Cold and concise expression, lacks emotional communication -
Dietary restrictions: Avoids high-purine foods, fried foods
Task Instruction:
“This summer, your three-generation family is taking a cruise trip with final preparations needed. On the 27th at 3 PM, you’ll board the ship in Dalian. Find a restaurant near the port for a family gathering first—suitable for three generations, with accessibility facilities and elderly/child-friendly dishes. After selection, book a table for 6 people at 12 noon that day. Prepare special travel items for elderly family members including a walking cane and adult diapers, but bringing them would be troublesome, so arrange delivery to the restaurant around 12 noon to take directly onto the ship. Your aunt is coming from Beijing via high-speed train to meet up. Help purchase a suitable morning train ticket for that day—she prefers first class and arrival before 11 AM.”
Complexity Analysis:
-
Multi-domain coordination restaurant booking + product delivery + ticket purchase -
Spatio-temporal constraints boarding time, delivery timing, train arrival -
Special requirements accessibility facilities, elderly/child friendly items, travel necessities -
User characteristics cold personality affecting interaction strategy
Key Takeaways and Future Directions
VitaBench represents a significant shift in agent evaluation paradigms—from isolated tool-calling accuracy to comprehensive assessment of real-world application complexity. Our findings indicate that even state-of-the-art models face substantial challenges when confronting real-world complexity.
Core Insights
-
Cross-domain coordination remains a bottleneck models perform adequately in single domains but success rates plummet in cross-scenario tasks -
Reasoning capabilities need improvement over 60% of failures stem from complex multi-constraint reasoning -
Stability requires attention model output consistency lags far behind single-attempt success rates -
Thinking mechanisms show promise explicit reasoning processes improve both effectiveness and efficiency
Research Opportunities
VitaBench provides a rich testing ground for:
-
Reinforcement learning improving agent strategies through environmental feedback -
Planning algorithms enhancing long-horizon task decomposition and coordination -
Error recovery developing better failure recovery and strategy adjustment capabilities -
Personalized interaction dynamically adapting interaction styles based on user profiles
We believe VitaBench will become an essential resource for advancing the next generation of practical AI agents, helping the research community bridge the gap between laboratory performance and real-world application.
Access VitaBench Resources
-
Paper: arXiv:2509.26490 -
Website: https://vitabench.github.io/ -
Codebase: https://github.com/meituan/vitabench -
Dataset: HuggingFace -
Leaderboard: Live Updates
@article{he2025vitabench,
title={VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications},
author={He, Wei and Sun, Yueqing and Hao, Hongyan and Hao, Xueyuan and Xia, Zhikang and Gu, Qi and Han, Chengcheng and Zhao, Dengchang and Su, Hui and Zhang, Kefeng and Gao, Man and Su, Xi and Cai, Xiaodong and Cai, Xunliang and Yang, Yu and Zhao, Yunke},
journal={arXiv preprint arXiv:2509.26490},
year={2025}
}
