Claude Can Now Use Tools Like a Developer—Here’s What Changed
“
Original white-paper: Introducing advanced tool use on the Claude Developer Platform
Author: Anthropic Engineering Team
Re-worked for global audiences by: EEAT Technical Communication Group
Reading level: college (associate degree and up)
Estimated reading time: 18 minutes
1. The Short Version
Claude gained three new abilities:
-
Tool Search – loads only the tools it needs, cutting context size by 85 %. -
Programmatic Tool Calling – writes and runs Python to call many tools in one shot; only the final answer re-enters the chat. -
Tool-Use Examples – real JSON samples baked into the tool definition so Claude copies correct formats instead of guessing.
Together they turn Claude from a “manual one-tool-at-a-time user” into an “orchestrator that can skim, script, and copy examples.”
2. Why the Old Way Breaks
3. Feature Deep Dive
3.1 Tool Search Tool – “Just-in-time” tool discovery
How it works (plain English)
-
You hand Anthropic all your tool definitions but tag the rarely used ones with defer_loading: true. -
Claude first sees only a tiny search tool + 3–5 high-frequency tools. -
When it realises “I need GitHub,” it searches the keyword; the platform injects only the matching tools (≈3k tokens). -
Everything else stays on the shelf.
Head-to-head numbers (internal eval, 58-tool set)
That is an 85 % token haircut and a 25-point accuracy jump on the same test.
Quick implementation
{
"tools": [
{ "type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex" },
{
"name": "github.createPullRequest",
"description": "Open a new pull request",
"input_schema": { ... },
"defer_loading": true
}
]
}
For MCP servers you can defer the whole bundle and whitelist the one function you constantly need:
{
"type": "mcp_toolset",
"mcp_server_name": "google-drive",
"default_config": { "defer_loading": true },
"configs": {
"search_files": { "defer_loading": false }
}
}
When to enable
✅ Tool defs already >10k tokens
✅ You notice wrong-tool errors
✅ 10+ tools live in the project
Skip it if you run <10 tools or every tool fires every turn.
3.2 Programmatic Tool Calling – Claude writes a Python script
Problem solved
-
Intermediate results no longer flood context. -
Parallel calls happen in one code block; no repeated model inference. -
Claude expresses logic (loops, filters, sums) in code instead of prose—fewer off-by-one mistakes.
Concrete example: “Who busted their Q3 travel budget?”
Available tools
-
get_team_members(department) -
get_expenses(user_id, quarter) -
get_budget_by_level(level)
Old flow
-
Fetch 20 members -
Call expenses 20 times → 2,000 line items back into context -
Model eyeballs every row, adds them up, compares to budget
New flow (Claude authors Python)
team = await get_team_members("engineering")
levels = list(set(m["level"] for m in team))
budgets = dict(zip(levels, await asyncio.gather(*[get_budget_by_level(l) for l in levels])))
expenses = await asyncio.gather(*[get_expenses(m["id"], "Q3") for m in team])
exceeded = []
for member, exp in zip(team, expenses):
budget = budgets[member["level"]]["travel_limit"]
spent = sum(e["amount"] for e in exp)
if spent > budget:
exceeded.append({"name": member["name"], "spent": spent, "limit": budget})
print(json.dumps(exceeded))
Only the final exceeded list (≈1k tokens) re-enters the chat; the 2,000 expense lines never touch context.
Measured gains
Four-step wiring
-
Tag the tools you are happy to be called from code
{ "name": "get_expenses", ..., "allowed_callers": ["code_execution_20250825"] }
-
Claude sends a code-execution request containing the script. -
You receive tool calls with a callerfield; send results back to the API—the sandbox digests them, not the model. -
When the script ends, the sandbox returns stdout; only that final output appears in context.
When to use
✅ Large data sets you immediately aggregate
✅ Three or more chained tool calls
✅ Parallel checks (50 endpoints, 100 files)
✅ Intermediate data must not bias later reasoning
Skip it for single quick lookups or when you want Claude to see every raw line.
3.3 Tool-Use Examples – show, don’t tell
What schema misses
due_date: string → 2024-11-06? Nov 6, 2024? ISO-8601?
reporter.id → UUID, “USR-12345”, or 12345?
escalation.level: 2 → what SLA hours map to that?
Fix: embed sample calls inside the definition
"input_examples": [
{
"title": "Login page returns 500 error",
"priority": "critical",
"reporter": { "id": "USR-12345", "contact": { "email": "jane@acme.com" } },
"due_date": "2024-11-06",
"escalation": { "level": 2, "notify_manager": true, "sla_hours": 4 }
},
{
"title": "Add dark mode support",
"labels": ["feature-request", "ui"],
"reporter": { "id": "USR-67890" }
}
]
Claude now copies the patterns: critical bug = full contact + escalation; feature request = light payload; date format = YYYY-MM-DD.
Accuracy on complex parameter handling jumped from 72 % to 90 % in Anthropic’s eval.
When to add examples
✅ Deeply nested optional structs
✅ Domain conventions (kebab-case labels, USR-xxx IDs)
✅ Several similar tools that differ only by subtle fields
Skip obvious single-field tools or standard formats Claude already knows (URL, email).
4. Putting It Together – a practical checklist
-
Start with your biggest pain
-
Context choking → Tool Search first -
Huge messy returns → Programmatic Calling first -
Constant 400 errors → Examples first
-
-
Layer the rest later; the three features are complementary, not mutually exclusive.
-
Write searchable descriptions
Good:"Search customer orders by date, status or total. Returns shipping & payment info."
Bad:"Query DB" -
Keep top 3–5 tools always loaded, defer the tail.
-
Document return shapes clearly so Claude can generate correct parsing code.
-
Use realistic values in examples (real cities, IDs, prices). One to five per tool is enough.
5. FAQ – what engineers ask first
Q1. Does the search step slow things down?
A tiny bit. If your tool set consumes >10k tokens or you see wrong-tool mistakes, the saved context plus accuracy gain outweighs the extra hop.
Q2. Could deferred tools be missed?
As long as the name or description contains the keyword, regex/BM25 will surface it. You can also plug your own embedding search.
Q3. Which languages can the sandbox run?
Currently Python 3 only. The API auto-wraps your marked tools into async Python functions.
Q4. Can I pip install extra packages?
The hosted sandbox does not support custom libraries. Roll your own code-execution environment and expose it through MCP if you need them.
Q5. Will examples bloat my prompt?
Each example is 0.5–1k tokens. In Anthropic tests the uplift in accuracy paid for the cost; add them only where usage is ambiguous.
6. Quick-start one-liner
client.beta.messages.create(
betas=["advanced-tool-use-2025-11-20"],
model="claude-sonnet-4-5-20250929",
max_tokens=4096,
tools=[
{"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
{"type": "code_execution_20250825", "name": "code_execution"},
# …your business tools with defer_loading / allowed_callers / input_examples
],
messages=[{"role": "user", "content": "Which team members exceeded their Q3 travel budget?"}]
)
7. Key Takeaways
-
Claude no longer needs the whole toolbox in its lap. -
It can write Python to orchestrate dozens of calls, hide the mess, and show only the answer. -
Real worked examples teach Claude your naming and formatting conventions better than any schema alone.
Use the feature that fixes your current bottleneck, then layer the others. You will finish with shorter prompts, faster runs, and far fewer “oops, wrong tool” moments—exactly what production agents need to earn user trust. Happy building!

