Claude Can Now Use Tools Like a Developer—Here’s What Changed

“

Original white-paper: Introducing advanced tool use on the Claude Developer Platform
Author: Anthropic Engineering Team
Re-worked for global audiences by: EEAT Technical Communication Group
Reading level: college (associate degree and up)
Estimated reading time: 18 minutes

1. The Short Version

Claude gained three new abilities:

Tool Search – loads only the tools it needs, cutting context size by 85 %.
Programmatic Tool Calling – writes and runs Python to call many tools in one shot; only the final answer re-enters the chat.
Tool-Use Examples – real JSON samples baked into the tool definition so Claude copies correct formats instead of guessing.

Together they turn Claude from a “manual one-tool-at-a-time user” into an “orchestrator that can skim, script, and copy examples.”

2. Why the Old Way Breaks

Pain point	Typical scene	Result
Context bloat	50+ tools shipped upfront = 100k tokens before the user says hello	Little room left for chat history
Wrong pick	`notification-send-user` vs. `notification-send-channel`	Message lands in the wrong place
Garbage feedback	10 MB log file streamed back in full	Important instructions evicted from context
Death by latency	Five tools = five round-trips, each waits on model inference	User drifts off
Schema ambiguity	JSON says `due_date: string` but never mentions the format	Backend throws 400 Bad Request

3. Feature Deep Dive

3.1 Tool Search Tool – “Just-in-time” tool discovery

How it works (plain English)

You hand Anthropic all your tool definitions but tag the rarely used ones with defer_loading: true.
Claude first sees only a tiny search tool + 3–5 high-frequency tools.
When it realises “I need GitHub,” it searches the keyword; the platform injects only the matching tools (≈3k tokens).
Everything else stays on the shelf.

Head-to-head numbers (internal eval, 58-tool set)

Approach	Up-front tokens	Accuracy (Opus 4)
Load everything	~77k	49 %
Tool Search	~8.7k	74 %

That is an 85 % token haircut and a 25-point accuracy jump on the same test.

Quick implementation

{
  "tools": [
    { "type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex" },
    {
      "name": "github.createPullRequest",
      "description": "Open a new pull request",
      "input_schema": { ... },
      "defer_loading": true
    }
  ]
}

For MCP servers you can defer the whole bundle and whitelist the one function you constantly need:

{
  "type": "mcp_toolset",
  "mcp_server_name": "google-drive",
  "default_config": { "defer_loading": true },
  "configs": {
    "search_files": { "defer_loading": false }
  }
}

When to enable

✅ Tool defs already >10k tokens
✅ You notice wrong-tool errors
✅ 10+ tools live in the project

Skip it if you run <10 tools or every tool fires every turn.

3.2 Programmatic Tool Calling – Claude writes a Python script

Problem solved

Intermediate results no longer flood context.
Parallel calls happen in one code block; no repeated model inference.
Claude expresses logic (loops, filters, sums) in code instead of prose—fewer off-by-one mistakes.

Concrete example: “Who busted their Q3 travel budget?”

Available tools

get_team_members(department)
get_expenses(user_id, quarter)
get_budget_by_level(level)

Old flow

Fetch 20 members
Call expenses 20 times → 2,000 line items back into context
Model eyeballs every row, adds them up, compares to budget

New flow (Claude authors Python)

team = await get_team_members("engineering")
levels = list(set(m["level"] for m in team))
budgets = dict(zip(levels, await asyncio.gather(*[get_budget_by_level(l) for l in levels])))
expenses = await asyncio.gather(*[get_expenses(m["id"], "Q3") for m in team])

exceeded = []
for member, exp in zip(team, expenses):
    budget = budgets[member["level"]]["travel_limit"]
    spent = sum(e["amount"] for e in exp)
    if spent > budget:
        exceeded.append({"name": member["name"], "spent": spent, "limit": budget})

print(json.dumps(exceeded))

Only the final exceeded list (≈1k tokens) re-enters the chat; the 2,000 expense lines never touch context.

Measured gains

Metric	Before	After
Tokens (avg)	43,588	27,297 (-37 %)
Latency	20+ inference passes	1 pass
Knowledge-retrieval score	25.6 %	28.5 %
GIA benchmark	46.5 %	51.2 %

Four-step wiring

Tag the tools you are happy to be called from code

{ "name": "get_expenses", ..., "allowed_callers": ["code_execution_20250825"] }

Claude sends a code-execution request containing the script.
You receive tool calls with a caller field; send results back to the API—the sandbox digests them, not the model.
When the script ends, the sandbox returns stdout; only that final output appears in context.

When to use

✅ Large data sets you immediately aggregate
✅ Three or more chained tool calls
✅ Parallel checks (50 endpoints, 100 files)
✅ Intermediate data must not bias later reasoning

Skip it for single quick lookups or when you want Claude to see every raw line.

3.3 Tool-Use Examples – show, don’t tell

What schema misses

due_date: string → 2024-11-06? Nov 6, 2024? ISO-8601?
reporter.id → UUID, “USR-12345”, or 12345?
escalation.level: 2 → what SLA hours map to that?

Fix: embed sample calls inside the definition

"input_examples": [
  {
    "title": "Login page returns 500 error",
    "priority": "critical",
    "reporter": { "id": "USR-12345", "contact": { "email": "jane@acme.com" } },
    "due_date": "2024-11-06",
    "escalation": { "level": 2, "notify_manager": true, "sla_hours": 4 }
  },
  {
    "title": "Add dark mode support",
    "labels": ["feature-request", "ui"],
    "reporter": { "id": "USR-67890" }
  }
]

Claude now copies the patterns: critical bug = full contact + escalation; feature request = light payload; date format = YYYY-MM-DD.
Accuracy on complex parameter handling jumped from 72 % to 90 % in Anthropic’s eval.

When to add examples

✅ Deeply nested optional structs
✅ Domain conventions (kebab-case labels, USR-xxx IDs)
✅ Several similar tools that differ only by subtle fields

Skip obvious single-field tools or standard formats Claude already knows (URL, email).

4. Putting It Together – a practical checklist

Start with your biggest pain
- Context choking → Tool Search first
- Huge messy returns → Programmatic Calling first
- Constant 400 errors → Examples first
Layer the rest later; the three features are complementary, not mutually exclusive.
Write searchable descriptions
Good: "Search customer orders by date, status or total. Returns shipping & payment info."
Bad: "Query DB"
Keep top 3–5 tools always loaded, defer the tail.
Document return shapes clearly so Claude can generate correct parsing code.
Use realistic values in examples (real cities, IDs, prices). One to five per tool is enough.

5. FAQ – what engineers ask first

Q1. Does the search step slow things down?
A tiny bit. If your tool set consumes >10k tokens or you see wrong-tool mistakes, the saved context plus accuracy gain outweighs the extra hop.

Q2. Could deferred tools be missed?
As long as the name or description contains the keyword, regex/BM25 will surface it. You can also plug your own embedding search.

Q3. Which languages can the sandbox run?
Currently Python 3 only. The API auto-wraps your marked tools into async Python functions.

Q4. Can I pip install extra packages?
The hosted sandbox does not support custom libraries. Roll your own code-execution environment and expose it through MCP if you need them.

Q5. Will examples bloat my prompt?
Each example is 0.5–1k tokens. In Anthropic tests the uplift in accuracy paid for the cost; add them only where usage is ambiguous.

6. Quick-start one-liner

client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-5-20250929",
    max_tokens=4096,
    tools=[
        {"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
        {"type": "code_execution_20250825", "name": "code_execution"},
        # …your business tools with defer_loading / allowed_callers / input_examples
    ],
    messages=[{"role": "user", "content": "Which team members exceeded their Q3 travel budget?"}]
)

7. Key Takeaways

Claude no longer needs the whole toolbox in its lap.
It can write Python to orchestrate dozens of calls, hide the mess, and show only the answer.
Real worked examples teach Claude your naming and formatting conventions better than any schema alone.

Use the feature that fixes your current bottleneck, then layer the others. You will finish with shorter prompts, faster runs, and far fewer “oops, wrong tool” moments—exactly what production agents need to earn user trust. Happy building!

Claude’s New Tool Use Capabilities: How Developers Can Boost Efficiency by 85%