Running on a Budget, Yet Smarter—How “Money-Wise” Search Agents Break the Performance Ceiling

Keywords: budget-aware tool use, test-time scaling, search agent, BATS, Budget Tracker, cost-performance Pareto frontier


Opening: Three Quick Questions

  1. Hand an agent 100 free search calls—will it actually use them?
  2. If it stops at 30 and calls it a day, will more budget move the accuracy needle?
  3. Can we teach the machine to check its wallet before every click?

A new joint study by Google, UCSB and NYU says YES.
“Simply letting the model see the remaining balance pushes accuracy up while keeping the tab unchanged—or even smaller.”

Below is a 3 000-word plain-English walk-through of the paper “Budget-Aware Tool-Use Enables Effective Agent Scaling”. No hype, no fluff—just facts, figures and copy-paste prompts you can run tonight.


Table of Contents

  1. Background: Why “more budget ≠ better”
  2. 101s: budget awareness, tool calls, test-time scaling
  3. The light-weight plug-in: Budget Tracker
  4. The full stack: BATS framework (plan + verify + pivot)
  5. Experiments: same 100 calls, up to +96 % accuracy gain
  6. The bill please: what does one search actually cost?
  7. Prompts & pseudo-code you can steal
  8. FAQ: what people always ask
  9. Take-away: one graphic to remember

1. Background: Why “more budget ≠ better”

Classic scaling wisdom:
“Not enough parameters? Add compute. Not enough compute? Add inference-time steps.”
Translated to search: “Give the agent 50 extra calls and it will dig deeper, right?”

Reality check:

  • Most agents feel “satisfied” after 30 calls and output early—70 calls expire unused.
  • Keep increasing the limit and accuracy hits a plateau (see Fig. 3 in the paper).
  • Root cause: no built-in sense of budget; they can’t pace themselves.

2. 101s: Budget Awareness, Tool Calls, Test-Time Scaling

Term One-sentence definition
Tool-call budget Hard cap on how many times each tool (search/browse) may be invoked.
Budget awareness The agent sees “used / remaining” counters at every step.
Test-time scaling Boosting performance by spending more compute during inference rather than training.
Pareto frontier The line where you can’t get higher accuracy without paying more, or save money without losing accuracy.

3. The Light-Weight Plug-In: Budget Tracker

What it does:
Like a car fuel gauge, it appends real-time balance to the context.

Code? Three lines:

<budget>
Query Budget Used: 7, Query Budget Remaining: 93
URL Budget Used: 2, URL Budget Remaining: 98
</budget>

When:
After every tool response, before the next thought.

Results:

  • BrowseComp data set: +2–3 % absolute accuracy across models.
  • Same accuracy target: 31 % lower total cost (search 14.2 → 8.5, browse 1.36 → 1.09).

4. The Full Stack: BATS Framework

Budget Tracker makes the agent see money; BATS (Budget-Aware Test-time Scaling) teaches it how to spend money.

Three modules:

4.1 Budget-Aware Planning

  1. Split clues:

    • Exploration clues (broad, surf candidates)
    • Verification clues (narrow, confirm facts)
  2. Tree checklist:
    [ ] pending [x] done [!] failed [~] partial
    Never delete history—avoids re-do work.
  3. Dynamic pruning:
    Adjust breadth/depth based on remaining budget.

4.2 Budget-Aware Self-Verification

After each candidate answer:

  1. Clause-by-clause check → satisfied / contradicted / unverifiable.

  2. Decision:

    • SUCCESS – all green.
    • CONTINUE – path OK, budget left, dig deeper.
    • PIVOT – fatal flaw or balance too low; switch route and compress old trace to 3-line summary to keep context short.
  3. Output JSON—no hand-written rules.

4.3 Unified Cost Metric

Turn everything into cents:

C_unified = token_cost + Σ(ci × Pi)
  • Token cost: official API prices (input / output / cache).
  • ci: actual calls. Pi: 0.1 ¢ in the paper (averaged post-hoc).

Now any two strategies can be compared on real money.


5. Experiments: Same 100 Calls, Up to +96 % Accuracy

Table 1: BrowseComp (budget 100 search + 100 browse)

Model / Method Accuracy Extra Training? Unified Cost
Gemini-2.5-Pro + ReAct 12.6 % No 9.9 ¢
Gemini-2.5-Pro + BATS 24.6 % No 11.0 ¢
OpenAI Deep Research (task-tuned) 51.5 % Yes

Zero extra training, accuracy almost doubled; only +1.1 ¢.

Figure 2: Early-Stopping curve (budget 5 → 200)

  • ReAct plateaus at 30 %.
  • BATS keeps climbing, 37.4 % at budget 200+22 % over ReAct.

6. The Bill Please: What Does One Search Cost?

Item Study Price Note
Google Custom Search API 0.1 ¢/call 100 calls ≈ 10 ¢
Browse (Jina.ai / Crawl4AI) 0.1 ¢/call 150 k chars truncated
Gemini-2.5-Pro input 0.15 ¢/1k tok Cache hit 0.075 ¢
Gemini-2.5-Pro output 0.6 ¢/1k tok

Typical 20-round trace:

  • 14 search + 2 browse ≈ 1.6 ¢
  • 32k input + 7k output tok ≈ 6.5 ¢
  • Total ≈ 8 ¢ for a multi-hop answer.

7. Prompts & Pseudo-Code You Can Steal

7.1 ReAct + Budget Tracker Snippet

You are an AI with search and browse tools. After each <tool_response>
a <budget> block shows remaining units. Adapt strategy accordingly:

HIGH (≥70 %) → 3–5 diverse queries + 2–3 URLs  
MEDIUM (30–70 %) → 2–3 precise queries + 1–2 URLs  
LOW (10–30 %) → 1 focused query + 1 most-promising URL  
CRITICAL (<10 %) → avoid depleted tool; answer only if essential

7.2 BATS Planning Pseudo-Code

def plan(question, budget):
    clues = decompose(question)        # exploration vs verification
    tree  = {1: {"text": "root", "status": "[ ]", "cost": (0, 0)}}
    while budget.left and not all_done(tree):
        node = most_promising(tree)
        if budget.cover(node.est_cost):
            result = execute(node)
            budget.update(result.cost)
            tree.update(result.nodes)
        else:
            prune_low_benefit(tree)
    return tree

7.3 Verifier JSON Template

{
  "verification": "birth-year OK, profession OK, location contradicts",
  "decision": "PIVOT",
  "justification": "location conflicts with Wikidata; 6 calls left",
  "trajectory_summary": "searched name+bday → found candidate A → resume mismatch",
  "details": {
    "failure_cause": "locked on wrong person via verification clue too early",
    "useful_info": "candidate A university valid, keep",
    "next_suggestion": "back-track to exploration; search uni + profession"
  }
}

8. FAQ

Q1: Do I need to fine-tune?
No. Budget Tracker and BATS are prompt-level only—zero gradients, zero extra weights.

Q2: Token budget vs tool-call budget— which matters?
Paper focuses on tool-call budget because:

  • It caps external knowledge directly.
  • Token usage is unpredictable before seeing web pages.
  • Real invoices charge per call, not per token.

Q3: Context explodes with web pages—how to cope?
BATS compresses old trajectories every 10 steps; cache tokens drop 57 % with no accuracy loss.

Q4: Can I port this to my vertical?
Swap search/browse for your SQL / REST / Python tools; update unit price Pi; keep the rest.


9. Take-Away: One Graphic to Remember

┌----------------------------┐
│ Budget awareness = fuel gauge │
│ Planning         = roadmap   │  
│ Self-verify      = early brake or pivot │  
│ Unified cost     = real dollars │  
└----------------------------┘

Remember the four blocks and your agent will squeeze every cent—and still outperform the big-spender baseline.


Citation (BibTeX)
@misc{bats2025,
title={Budget-Aware Tool-Use Enables Effective Agent Scaling},
author={Zifeng Wang et al.},
url={https://arxiv.org/html/2511.17006v1},
year={2025}
}