Running on a Budget, Yet Smarter—How “Money-Wise” Search Agents Break the Performance Ceiling
Keywords: budget-aware tool use, test-time scaling, search agent, BATS, Budget Tracker, cost-performance Pareto frontier
Opening: Three Quick Questions
-
Hand an agent 100 free search calls—will it actually use them? -
If it stops at 30 and calls it a day, will more budget move the accuracy needle? -
Can we teach the machine to check its wallet before every click?
A new joint study by Google, UCSB and NYU says YES.
“Simply letting the model see the remaining balance pushes accuracy up while keeping the tab unchanged—or even smaller.”
Below is a 3 000-word plain-English walk-through of the paper “Budget-Aware Tool-Use Enables Effective Agent Scaling”. No hype, no fluff—just facts, figures and copy-paste prompts you can run tonight.
Table of Contents
-
Background: Why “more budget ≠ better” -
101s: budget awareness, tool calls, test-time scaling -
The light-weight plug-in: Budget Tracker -
The full stack: BATS framework (plan + verify + pivot) -
Experiments: same 100 calls, up to +96 % accuracy gain -
The bill please: what does one search actually cost? -
Prompts & pseudo-code you can steal -
FAQ: what people always ask -
Take-away: one graphic to remember
1. Background: Why “more budget ≠ better”
Classic scaling wisdom:
“Not enough parameters? Add compute. Not enough compute? Add inference-time steps.”
Translated to search: “Give the agent 50 extra calls and it will dig deeper, right?”
Reality check:
-
Most agents feel “satisfied” after 30 calls and output early—70 calls expire unused. -
Keep increasing the limit and accuracy hits a plateau (see Fig. 3 in the paper). -
Root cause: no built-in sense of budget; they can’t pace themselves.
2. 101s: Budget Awareness, Tool Calls, Test-Time Scaling
| Term | One-sentence definition |
|---|---|
| Tool-call budget | Hard cap on how many times each tool (search/browse) may be invoked. |
| Budget awareness | The agent sees “used / remaining” counters at every step. |
| Test-time scaling | Boosting performance by spending more compute during inference rather than training. |
| Pareto frontier | The line where you can’t get higher accuracy without paying more, or save money without losing accuracy. |
3. The Light-Weight Plug-In: Budget Tracker
What it does:
Like a car fuel gauge, it appends real-time balance to the context.
Code? Three lines:
<budget>
Query Budget Used: 7, Query Budget Remaining: 93
URL Budget Used: 2, URL Budget Remaining: 98
</budget>
When:
After every tool response, before the next thought.
Results:
-
BrowseComp data set: +2–3 % absolute accuracy across models. -
Same accuracy target: 31 % lower total cost (search 14.2 → 8.5, browse 1.36 → 1.09).
4. The Full Stack: BATS Framework
Budget Tracker makes the agent see money; BATS (Budget-Aware Test-time Scaling) teaches it how to spend money.
Three modules:
4.1 Budget-Aware Planning
-
Split clues: -
Exploration clues (broad, surf candidates) -
Verification clues (narrow, confirm facts)
-
-
Tree checklist:
[ ] pending [x] done [!] failed [~] partial
Never delete history—avoids re-do work. -
Dynamic pruning:
Adjust breadth/depth based on remaining budget.
4.2 Budget-Aware Self-Verification
After each candidate answer:
-
Clause-by-clause check → satisfied / contradicted / unverifiable.
-
Decision:
-
SUCCESS – all green. -
CONTINUE – path OK, budget left, dig deeper. -
PIVOT – fatal flaw or balance too low; switch route and compress old trace to 3-line summary to keep context short.
-
-
Output JSON—no hand-written rules.
4.3 Unified Cost Metric
Turn everything into cents:
C_unified = token_cost + Σ(ci × Pi)
-
Token cost: official API prices (input / output / cache). -
ci: actual calls. Pi: 0.1 ¢ in the paper (averaged post-hoc).
Now any two strategies can be compared on real money.
5. Experiments: Same 100 Calls, Up to +96 % Accuracy
Table 1: BrowseComp (budget 100 search + 100 browse)
| Model / Method | Accuracy | Extra Training? | Unified Cost |
|---|---|---|---|
| Gemini-2.5-Pro + ReAct | 12.6 % | No | 9.9 ¢ |
| Gemini-2.5-Pro + BATS | 24.6 % | No | 11.0 ¢ |
| OpenAI Deep Research (task-tuned) | 51.5 % | Yes | — |
⇒ Zero extra training, accuracy almost doubled; only +1.1 ¢.
Figure 2: Early-Stopping curve (budget 5 → 200)
-
ReAct plateaus at 30 %. -
BATS keeps climbing, 37.4 % at budget 200—+22 % over ReAct.
6. The Bill Please: What Does One Search Cost?
| Item | Study Price | Note |
|---|---|---|
| Google Custom Search API | 0.1 ¢/call | 100 calls ≈ 10 ¢ |
| Browse (Jina.ai / Crawl4AI) | 0.1 ¢/call | 150 k chars truncated |
| Gemini-2.5-Pro input | 0.15 ¢/1k tok | Cache hit 0.075 ¢ |
| Gemini-2.5-Pro output | 0.6 ¢/1k tok | — |
Typical 20-round trace:
-
14 search + 2 browse ≈ 1.6 ¢ -
32k input + 7k output tok ≈ 6.5 ¢ -
Total ≈ 8 ¢ for a multi-hop answer.
7. Prompts & Pseudo-Code You Can Steal
7.1 ReAct + Budget Tracker Snippet
You are an AI with search and browse tools. After each <tool_response>
a <budget> block shows remaining units. Adapt strategy accordingly:
HIGH (≥70 %) → 3–5 diverse queries + 2–3 URLs
MEDIUM (30–70 %) → 2–3 precise queries + 1–2 URLs
LOW (10–30 %) → 1 focused query + 1 most-promising URL
CRITICAL (<10 %) → avoid depleted tool; answer only if essential
7.2 BATS Planning Pseudo-Code
def plan(question, budget):
clues = decompose(question) # exploration vs verification
tree = {1: {"text": "root", "status": "[ ]", "cost": (0, 0)}}
while budget.left and not all_done(tree):
node = most_promising(tree)
if budget.cover(node.est_cost):
result = execute(node)
budget.update(result.cost)
tree.update(result.nodes)
else:
prune_low_benefit(tree)
return tree
7.3 Verifier JSON Template
{
"verification": "birth-year OK, profession OK, location contradicts",
"decision": "PIVOT",
"justification": "location conflicts with Wikidata; 6 calls left",
"trajectory_summary": "searched name+bday → found candidate A → resume mismatch",
"details": {
"failure_cause": "locked on wrong person via verification clue too early",
"useful_info": "candidate A university valid, keep",
"next_suggestion": "back-track to exploration; search uni + profession"
}
}
8. FAQ
Q1: Do I need to fine-tune?
No. Budget Tracker and BATS are prompt-level only—zero gradients, zero extra weights.
Q2: Token budget vs tool-call budget— which matters?
Paper focuses on tool-call budget because:
-
It caps external knowledge directly. -
Token usage is unpredictable before seeing web pages. -
Real invoices charge per call, not per token.
Q3: Context explodes with web pages—how to cope?
BATS compresses old trajectories every 10 steps; cache tokens drop 57 % with no accuracy loss.
Q4: Can I port this to my vertical?
Swap search/browse for your SQL / REST / Python tools; update unit price Pi; keep the rest.
9. Take-Away: One Graphic to Remember
┌----------------------------┐
│ Budget awareness = fuel gauge │
│ Planning = roadmap │
│ Self-verify = early brake or pivot │
│ Unified cost = real dollars │
└----------------------------┘
Remember the four blocks and your agent will squeeze every cent—and still outperform the big-spender baseline.
Citation (BibTeX)
@misc{bats2025,
title={Budget-Aware Tool-Use Enables Effective Agent Scaling},
author={Zifeng Wang et al.},
url={https://arxiv.org/html/2511.17006v1},
year={2025}
}

