K2 Vendor Verifier: Ensuring Reliable Tool Calls for Kimi K2

高效码农

2 months ago

K2 Vendor Verifier: Ensuring Reliable Tool Calls for Kimi K2

In the rapidly evolving world of AI, where new models and capabilities emerge almost daily, one critical aspect often gets overlooked: reliability. When it comes to AI agents—systems designed to perform tasks independently—the ability to accurately interact with external tools (known as “tool calls”) can make or break their usefulness. This is especially true for Kimi K2, a model specifically built with a focus on “agentic loop”—the continuous cycle of an AI agent receiving inputs, processing information, using tools, and generating outputs.

Recognizing the importance of consistent tool call performance, the team behind Kimi K2 developed a solution: K2 Vendor Verifier (K2VV). This comprehensive evaluation system monitors and enhances the quality of K2 APIs across different providers, ensuring that users get the reliable performance they expect, regardless of which vendor they choose.

What Is K2 Vendor Verifier?

K2 Vendor Verifier, or K2VV for short, is a specialized tool created to address a growing concern in the AI community: significant variations in tool call performance among different open-source solutions and vendors offering Kimi K2 services.

When businesses and developers select an AI service provider, they often focus on obvious factors like cost and response speed. While these are important, they can overshadow a more fundamental issue: how accurately the AI model can execute tool calls. These seemingly small differences in accuracy can have a big impact on user experience and can even affect how well K2 performs in various benchmark tests.

The primary goal of K2VV is simple yet crucial: to monitor and improve the quality of all K2 APIs. By providing clear, objective evaluations of each vendor’s performance, K2VV helps ensure that everyone—from individual developers to large enterprises—can access a consistent, high-performing Kimi K2 model.

Latest Evaluation Results (October 23, 2025)

The most recent evaluation of K2 service providers was conducted on October 23, 2025. This comprehensive test looked at 18 different providers offering the kimi-k2-0905-preview model, measuring their performance across two key metrics: ToolCall-Trigger Similarity and ToolCall-Schema Accuracy.

Here’s a detailed breakdown of how each provider performed:

Model Name	Provider	ToolCall-Trigger Similarity	ToolCall-Schema Accuracy
			count_finish_reason_tool_calls	count_successful_tool_call	schema_accuracy
kimi-k2-0905-preview	MoonshotAI	–	1274	1274	100.00%
kimi-k2-0905-preview	Moonshot AI Turbo	≥80%	1296	1296	100.00%
kimi-k2-0905-preview	DeepInfra	≥80%	1405	1405	100.00%
kimi-k2-0905-preview	Infinigence	≥80%	1249	1249	100.00%
kimi-k2-0905-preview	NovitaAI	≥80%	1263	1263	100.00%
kimi-k2-0905-preview	SiliconFlow	≥80%	1280	1276	99.69%
kimi-k2-0905-preview	Chutes	≥80%	1225	1187	96.90%
kimi-k2-0905-preview	vLLM	≥80%	1325	1007	76.00%
kimi-k2-0905-preview	SGLang	≥80%	1269	928	73.13%
kimi-k2-0905-preview	PPIO	≥80%	1294	945	73.03%
kimi-k2-0905-preview	AtlasCloud	≥80%	1272	925	72.72%
kimi-k2-0905-preview	Baseten	≥80%	1363	982	72.05%
kimi-k2-0905-preview	Together	≥80%	1260	900	71.43%
kimi-k2-0905-preview	Volc	≥80%	1344	962	71.58%
kimi-k2-0905-preview	Fireworks	79.68%	1443	1443	100.00%
kimi-k2-0905-preview	Groq	68.21%	1016	1016	100.00%
kimi-k2-0905-preview	Nebius	48.59%	636	549	86.32%

Understanding the Results

To put these numbers in perspective, it’s helpful to look at what they mean in practical terms:

Top Performers: MoonshotAI, Moonshot AI Turbo, DeepInfra, Infinigence, and NovitaAI all achieved a perfect 100% schema accuracy. This means that every time these services triggered a tool call, they did so in the correct format. SiliconFlow was very close behind at 99.69%, meaning only 0.31% of their tool calls had formatting issues.
Mid – Range Performers: Chutes scored 96.90%, which is still very good—only about 3 out of every 100 tool calls had formatting problems. This level of performance would be acceptable for many applications, though not for the most demanding use cases.
Lower Performers: The numbers drop significantly after Chutes. vLLM, at 76.00%, had formatting issues with about 1 in 4 tool calls. SGLang and PPIO were even lower, with roughly 1 in 3 tool calls failing schema validation.
Special Cases: Fireworks and Groq are interesting because they showed perfect schema accuracy (100%) but lower ToolCall-Trigger Similarity scores (79.68% and 68.21% respectively). This means that when they did trigger tool calls, those calls were correctly formatted, but they didn’t trigger tool calls as often as they should have (or triggered them when they shouldn’t have).
Lowest Performer: Nebius had the lowest ToolCall-Trigger Similarity score at 48.59% and a schema accuracy of 86.32%. This means not only did it struggle with when to trigger tool calls, but even when it did, about 1 in 7 calls had formatting issues.

The evaluation team also tested the official K2 API multiple times to understand the normal range of performance. They found that the lowest tool_call_f1 score was 82.71%, with an average of 84%. Given that AI models have some inherent randomness in their outputs, they determined that a tool_call_f1 score above 80% is generally acceptable for most practical purposes.

Evaluation Metrics Explained

To understand the results, it’s important to know how the evaluation was conducted. K2VV uses two main metrics to assess performance: ToolCall-Trigger Similarity and ToolCall-Schema Accuracy. Let’s break down what each of these means in simple terms.

ToolCall-Trigger Similarity

This metric, measured by a score called tool_call_f1, evaluates whether the model decides to trigger a tool call at the right times. To calculate this, evaluators compare each provider’s performance against the official Moonshot AI API, which serves as the gold standard.

Here are the key components:

True Positive (TP): Both the tested provider and the official API decided to trigger a tool call. This is a “correct” decision.
False Positive (FP): The tested provider triggered a tool call, but the official API did not. This is like a false alarm.
False Negative (FN): The official API triggered a tool call, but the tested provider did not. This is like missing an important action.
True Negative (TN): Neither the tested provider nor the official API triggered a tool call. This is another “correct” decision.

Using these components, two additional measures are calculated:

tool_call_precision: This is the proportion of triggered tool calls that should have been triggered. The formula is TP / (TP + FP). In simple terms, it’s how often the model is “right” when it decides to trigger a tool call.
tool_call_recall: This is the proportion of tool calls that should have been triggered and actually were. The formula is TP / (TP + FN). This measures how many of the necessary tool calls the model actually makes.

The tool_call_f1 score combines these two measures into a single number using the formula: 2 * tool_call_precision * tool_call_recall / (tool_call_precision + tool_call_recall). This gives a balanced view of how well the model is doing both at avoiding unnecessary tool calls and making necessary ones.

ToolCall-Schema Accuracy

While ToolCall-Trigger Similarity is about when to trigger tool calls, ToolCall-Schema Accuracy is about how well those tool calls are formatted. This is measured by schema_accuracy.

The key components here are:

count_finish_reason_tool_calls: This is simply the total number of times the model decided to trigger a tool call (indicated by a “finish reason” of “tool_calls”).
count_successful_tool_call: This is the number of those tool calls that passed a strict schema validation check. In other words, these tool calls were formatted correctly according to the required specifications.

The schema_accuracy score is then calculated as count_successful_tool_call / count_finish_reason_tool_calls, or the percentage of tool calls that were formatted correctly.

This metric is crucial because even if a model triggers a tool call at the right time, if the format is incorrect, the tool won’t understand the request and can’t properly respond. It’s like dialing the right phone number but speaking a language the person on the other end doesn’t understand.

How the Testing Works

The evaluation process is designed to be thorough and fair. Here’s a step-by-step look at how it’s done:

Test Set: The evaluators use a set of 4,000 different requests to test each provider. This diverse set of inputs ensures that the model is tested under many different scenarios.
Comparison Standard: Each provider’s responses to these 4,000 requests are collected and compared directly against the responses from the official Moonshot AI API. This ensures a consistent benchmark for all providers.
Periodic Evaluation: K2 vendors are evaluated on a regular basis. This means that the performance rankings can change over time as providers improve their services or new vendors enter the market.
Inclusion Opportunity: If a vendor isn’t on the evaluation list but wants to be included, they can contact the K2VV team to request participation in future evaluations.
Transparency: To ensure transparency, detailed sample data from the tests is provided in a file called samples.jsonl. This allows anyone to review the types of requests used in the evaluation.

This rigorous testing process helps ensure that the results are reliable and that users can trust the comparisons between different providers.

Recommendations for Vendors

Based on the evaluation results, the K2VV team has identified several key areas where vendors can improve their Kimi K2 performance. These recommendations are designed to help all providers reach the high standards set by the top performers.

1. Use the Correct Versions

One of the most common issues found in lower-performing providers is the use of incorrect software versions. Using outdated or incompatible versions can lead to all sorts of performance issues, including problems with tool calls.

The K2VV team recommends the following specific versions:

For vllm: version 0.11.0 (available at https://github.com/vllm-project/vllm/releases/tag/v0.11.0)
For sglang: version 0.5.3rc0 (available at https://github.com/sgl-project/sglang/releases/tag/v0.5.3rc0)
For the model itself: moonshotai/Kimi-K2-Instruct-0905 with the specific commit 94a4053eb8863059dd8afc00937f054e1365abbd (available on Hugging Face)

Using these recommended versions ensures that the software and model are working together as intended, which is the first step toward reliable tool call performance.

2. Rename Tool Call IDs

The Kimi-K2 model expects all tool call IDs in historical messages to follow a specific format: functions.func_name:idx. For example, a valid ID might look like functions.search:0.

However, in previous test cases, evaluators found some tool IDs that didn’t follow this format, such as search:0 (missing the “functions.” prefix). These malformed IDs can confuse the Kimi-K2 model, leading it to generate incorrect tool call IDs that then fail validation.

To address this issue, the K2VV team manually adds the functions. prefix to all previous tool calls in their testing environment. They recommend that both users and vendors adopt this fix in their own systems.

Interestingly, this isn’t an issue for the official Moonshot API because it automatically renames all tool call IDs to the correct functions.func_name:idx format before sending them to the K2 model. This small but important step helps ensure that the model receives consistent, correctly formatted information.

3. Add Guided Encoding

Large language models like Kimi K2 generate text one piece (or “token”) at a time, based on probability. They don’t have a built-in mechanism to ensure that their output strictly follows a specific JSON schema or format, even if they’re given clear instructions to do so.

This means that even with careful prompting, there’s a chance the model might omit required fields, add extra ones that aren’t needed, or nest information incorrectly in its tool call requests. These small errors can cause the entire tool call to fail.

To address this, the K2VV team recommends adding “guided encoding” to the implementation. Guided encoding is a technique that helps ensure the model’s output follows the correct schema by providing additional structure and constraints during the generation process.

Think of it like using a template when filling out a form—by providing a clear structure, you’re much less likely to make mistakes or miss important information.

How to Verify Performance Yourself

One of the great things about K2VV is that it’s not just a closed evaluation system—anyone can run the same tests to verify performance for themselves. This transparency helps build trust in the results and allows users to check specific providers that are important to them.

To run the evaluation tool with the provided sample data, follow these steps:

Use the following command in your terminal or command prompt:

python tool_calls_eval.py samples.jsonl \
    --model kimi-k2-0905-preview \
    --base-url https://api.moonshot.cn/v1 \
    --api-key YOUR_API_KEY \
    --concurrency 5 \
    --output results.jsonl \
    --summary summary.json

Let’s break down what each part of this command does:

samples.jsonl: This is the path to the test set file, which contains the 4,000 test requests in JSONL format (JSON Lines, where each line is a separate JSON object).
--model: This specifies the model name, in this case “kimi-k2-0905-preview”.
--base-url: This is the API endpoint URL of the provider you’re testing. In the example, it’s set to the official Moonshot API.
--api-key: This is your personal API key for authentication. You can also set this as an environment variable called OPENAI_API_KEY instead of including it directly in the command.
--concurrency: This sets the maximum number of concurrent requests (default is 5). This controls how many tests run at the same time.
--output: This specifies the path where detailed results will be saved (default is results.jsonl).
--summary: This specifies where the aggregated summary of results will be saved (default is summary.json).
--timeout: This sets the per-request timeout in seconds (default is 600, which is 10 minutes).
--retries: This sets the number of retries if a request fails (default is 3).
--extra-body: This allows you to add extra JSON body content to each request payload (for example, ‘{“temperature”:0.6}’ to set a specific temperature parameter).
--incremental: This enables incremental mode, which only reruns failed requests from previous tests.

If you want to test other providers through OpenRouter, you can use this modified command:

python tool_calls_eval.py samples.jsonl \
    --model moonshotai/kimi-k2-0905 \
    --base-url https://openrouter.ai/api/v1 \
    --api-key YOUR_OPENROUTER_API_KEY \
    --concurrency 5 \
    --extra-body '{"provider": {"only": ["YOUR_DESIGNATED_PROVIDER"]}}'

In this case, replace “YOUR_DESIGNATED_PROVIDER” with the specific provider you want to test through OpenRouter.

Running these tests yourself allows you to verify the performance of K2 providers under your specific conditions and for your specific use cases.

Get Involved: Your Input Matters

The K2VV project is ongoing, and the team is constantly working to improve the evaluation process. They actively encourage input from users and vendors alike to make the system as useful and comprehensive as possible.

If there are specific metrics or test cases that you believe are important to include in future evaluations, you can share your thoughts by creating an issue at https://github.com/MoonshotAI/K2-Vendor-Verifier/issues/9.

Similarly, if there’s a vendor that you’d like to see included in future evaluations, you can suggest them by creating an issue at https://github.com/MoonshotAI/K2-Vendor-Verifier/issues/10.

This collaborative approach helps ensure that K2VV continues to meet the needs of the broader AI community and remains relevant as the field evolves.

If you have any other questions or concerns about K2 Vendor Verifier, you can reach out directly to the team at shijuanfeng@moonshot.cn.

Why This Matters for Users

At first glance, discussions about tool call accuracy and schema validation might seem like technical details that only matter to developers. But in reality, these factors have a direct impact on anyone who uses AI-powered tools and services.

Imagine you’re using an AI assistant to help manage your calendar. If the AI fails to trigger a tool call when it should (low recall), it might miss adding an important meeting. If it triggers tool calls when it shouldn’t (low precision), you might end up with phantom events cluttering your schedule. And if the tool calls themselves are formatted incorrectly (low schema accuracy), the calendar might not update at all, or might record the wrong information.

In a business setting, these issues can be even more problematic. A customer service AI that fails to correctly use a tool to look up account information could provide wrong answers to customers. A financial AI that mishandles tool calls when analyzing market data could lead to poor investment decisions.

By providing clear, objective evaluations of K2 providers, K2VV helps users make informed decisions about which services to trust with their important tasks. It also creates healthy competition among providers, encouraging them to improve their performance to meet the high standards set by the top performers.

As AI continues to play a larger role in both our personal and professional lives, tools like K2VV will become increasingly important. They help demystify the often complex world of AI performance, giving users the information they need to choose the right tools for their needs.

In the end, the goal of K2 Vendor Verifier is simple: to ensure that the promise of reliable, effective AI agents becomes a reality for everyone. By focusing on the critical but often overlooked details of tool call performance, K2VV is helping to raise the bar for AI services across the industry.