Scaling AI Agents: When Adding More Models Hurts Performance
“
Core question: Does adding more AI agents always improve results?
Short answer: Only when the task is parallelizable, tool-light, and single-agent accuracy is below ~45%. Otherwise, coordination overhead eats all gains.
What This Article Answers
-
How can you predict whether multi-agent coordination will help or hurt before you deploy? -
What do 180 controlled configurations across finance, web browsing, planning, and office workflows reveal? -
Which practical checklist can you copy-paste into your next design doc?
1 The Setup: 180 Experiments, One Variable—Coordination Structure
Summary: Researchers locked prompts, tools, and token budgets, then varied only how agents talk to each other. This isolates “architecture” from “more compute.”
1.1 Hardware & Budget
-
Token ceiling: 4 800 per task, rigidly enforced -
Model families: OpenAI GPT-5 variants, Google Gemini 2.x, Anthropic Claude Sonnet 3.7→4.5 -
Intelligence Index range: 34–66 (composite reasoning score) -
Agent count: 1–9, but 3–4 is the practical ceiling under the token cap
1.2 Benchmarks Chosen
Author reflection: I once stacked eight agents on a web-scraping pipeline; latency ballooned to 18 s and accuracy dropped 12 %. Seeing the token-budget rule here explains why—every extra “hey, here’s what I found” message steals tokens from actual reasoning.
2 The Architectures in Plain Language
Summary: Five blueprints, zero hype. Pick one, you pick your pain point.
2.1 Single-Agent System (SAS)
-
One LLM, one loop, zero chatter -
Complexity: O(k) turns, O(k) memory
2.2 Independent (MAS-I)
-
n agents, no communication, final majority vote -
Complexity: O(n·k) turns, error amplification 17×
2.3 Centralized (MAS-C)
-
1 orchestrator + n workers, star topology -
Complexity: O(r·n·k) turns, overhead 285 %, error contained to 4.4×
2.4 Decentralized (MAS-D)
-
n agents, all-to-all debate, consensus vote -
Complexity: O(d·n·k) turns, overhead 263 %, redundancy 50 %
2.5 Hybrid (MAS-H)
-
Star + selective peer edges -
Complexity: O(r·n·k + p·n) turns, overhead 515 %, protocol fragile
3 The Numbers: From +81 % Win to −70 % Loss
Summary: Task structure, not team size, drives the delta.
3.1 Finance-Agent (Parallelizable)
-
Best: Centralized +80.9 % (0.631 vs 0.349 SAS) -
Mechanism: Agents independently analyze revenue, cost, market; orchestrator synthesizes
3.2 BrowseComp-Plus (Dynamic)
-
Best: Decentralized +9.2 % -
Mechanism: Peer debate corrects stale web data; too much hierarchy lags behind live pages
3.3 Workbench (Tool-Heavy but Linear)
-
Best: Decentralized +5.7 % -
Reason: Gains from parallel tool calls, but 12-tool budget burns tokens fast
3.4 PlanCraft (Sequential)
-
Worst: Independent −70 % -
Reason: Every crafting step mutates world state; agents work on divergent realities
Image source: Unsplash
4 The Predictive Formula: Plug in Your Own Task
Summary: A 20-coefficient mixed-effects model explains 51 % of variance on unseen configs—no dataset-specific parameters.
4.1 Key Interaction Terms (standardized)
Performance ≈
0.256·I² // accelerating returns from smarter models
- 0.330·Ec×T // efficiency–tools trade-off (strongest)
- 0.141·O%×T // overhead explodes with tool count
- 0.408·PSA×log(1+n) // baseline paradox: high PSA → negative returns
4.2 Worked Example
Task: internal BI dashboard generator
-
T = 14 tools, PSA = 0.63, I = 58, planned n = 4 -
Measured Ec ≈ 0.08, O% ≈ 400 %
Plug-in: multi-agent term = −0.330×0.08×14 − 0.141×4×14 − 0.408×0.63×1.6 ≈ −0.29 (standardized) → predict drop of ~15 raw accuracy points.
Decision: stay Single; shipped version confirms 8 % faster and 6 % fewer errors.
Author reflection: We used to argue “let’s just try four agents overnight.” Having an explicit threshold (PSA ≈ 0.45) turns the debate into a three-minute spreadsheet exercise—engineering time reclaimed.
5 Scenario Playbook: Three Common Task Types
5.1 Code Snippet: Quick Evaluator
def predict_delta(I, T, PSA, n, Ec, O):
# returns positive if multi-agent recommended
return (0.256 * I**2
- 0.330 * Ec * T
- 0.141 * O * T
- 0.408 * PSA * (1 + n).log())
Run it in a Jupyter cell against your next feature spec—takes 30 s.
6 Micro-Lessons from the Lab
Summary: Small details that don’t fit a table but save hours of pain.
-
Turn scaling exponent 1.724: beyond 4 agents, per-agent reasoning depth collapses—don’t chase 10× parallelism. -
Message saturation: once density > 0.4 messages per turn, extra chatter adds < 2 % accuracy—kill the thread. -
Error taxonomy: Logical contradiction and numerical drift dominate; Centralized cuts the former by 36 %, but Hybrid accidentally amplifies rounding errors by 10 %. -
Family quirks: Anthropic models tolerate orchestrator load best; OpenAI Hybrid peaks higher but cost 3×; Gemini stays flat across topologies—great if you hate surprises.
Author reflection: I used to insist on “one architecture for all products.” The data screams “match topology to task.” We now keep a living sheet: every new customer scenario gets a 15-minute complexity score → architecture → budget. Simpler code, happier customers.
7 Action Checklist / Implementation Steps
-
Run single-agent baseline → record PSA and tool count T. -
If PSA > 0.45 or T > 12 → default Single, skip to 7. -
Score task decomposability (0–1): can sub-outputs be merged without order? -
Estimate Ec and O% from small 3-agent dry run (100 traces enough). -
Plug into predictor; positive Δ → proceed, negative → stay Single. -
Choose architecture: -
High decomposability → Centralized -
Dynamic env → Decentralized -
Budget luxury + need creativity → Hybrid (accept 6× tokens)
-
-
Cap team at 4 agents; beyond this, latency grows super-linearly. -
Monitor production: Ec < 0.1 or error amplification Ae > 10 → rollback.
One-page Overview
Adding agents only helps when three stars align: task can be split, tool count is modest, and single-agent accuracy is below ~45 %. A data-driven formula predicts the delta before deployment; 180 experiments show gains from +81 % to −70 % depending on task structure, not model brand. Centralized topology wins for parallel analytics, Decentralized for dynamic search, Hybrid only if budget is elastic. Keep teams ≤ 4 and always measure coordination overhead—token tax scales super-linearly with agent count.
FAQ
Q1: What is the most important single variable to check?
A: Tool count T interacting with efficiency Ec—this term dwarfs others by 57 %.
Q2: Can I stretch beyond 4 agents if latency is not an issue?
A: Token budget inflates as n^1.72; even with infinite money, per-agent reasoning depth becomes paper-thin.
Q3: Why does Independent always lose?
A: No error correction—mistakes multiply 17×; useful only for ultra-cheap ensembles where 30 % error is acceptable.
Q4: Does model family matter more than architecture?
A: No—architecture-task fit explains 3× more variance, but within the same task, picking the wrong family can swing 30 %.
Q5: Is the formula open source?
A: All coefficients and normalization stats are in the paper; no proprietary data needed to re-implement.
Q6: How do I measure Ec quickly?
A: Run 50 trials, record success and total turns; Ec = success / (turns / SAS_baseline_turns).
Q7: Will this hold for multimodal or robotics tasks?
A: Paper stays in text+tool domain; if your embodied task has similar sequential dependency and tool tax, the same trade-off logic applies but coefficients need re-fitting.

