Scaling AI Agents: When Adding More Models Hurts Performance

“

Core question: Does adding more AI agents always improve results?
Short answer: Only when the task is parallelizable, tool-light, and single-agent accuracy is below ~45%. Otherwise, coordination overhead eats all gains.

What This Article Answers

How can you predict whether multi-agent coordination will help or hurt before you deploy?
What do 180 controlled configurations across finance, web browsing, planning, and office workflows reveal?
Which practical checklist can you copy-paste into your next design doc?

1 The Setup: 180 Experiments, One Variable—Coordination Structure

Summary: Researchers locked prompts, tools, and token budgets, then varied only how agents talk to each other. This isolates “architecture” from “more compute.”

1.1 Hardware & Budget

Token ceiling: 4 800 per task, rigidly enforced
Model families: OpenAI GPT-5 variants, Google Gemini 2.x, Anthropic Claude Sonnet 3.7→4.5
Intelligence Index range: 34–66 (composite reasoning score)
Agent count: 1–9, but 3–4 is the practical ceiling under the token cap

1.2 Benchmarks Chosen

Benchmark	Domain	Tool Calls	Key Feature
Finance-Agent	Equity research	6–8	Parallel sub-tasks (revenue, cost, market)
BrowseComp-Plus	Multi-site research	10–16	Dynamic state, partial observability
Workbench	Business workflows	12–14	Deterministic tool chains
PlanCraft	Minecraft crafting	4–6	Strict sequential dependencies

Author reflection: I once stacked eight agents on a web-scraping pipeline; latency ballooned to 18 s and accuracy dropped 12 %. Seeing the token-budget rule here explains why—every extra “hey, here’s what I found” message steals tokens from actual reasoning.

2 The Architectures in Plain Language

Summary: Five blueprints, zero hype. Pick one, you pick your pain point.

2.1 Single-Agent System (SAS)

One LLM, one loop, zero chatter
Complexity: O(k) turns, O(k) memory

2.2 Independent (MAS-I)

n agents, no communication, final majority vote
Complexity: O(n·k) turns, error amplification 17×

2.3 Centralized (MAS-C)

1 orchestrator + n workers, star topology
Complexity: O(r·n·k) turns, overhead 285 %, error contained to 4.4×

2.4 Decentralized (MAS-D)

n agents, all-to-all debate, consensus vote
Complexity: O(d·n·k) turns, overhead 263 %, redundancy 50 %

2.5 Hybrid (MAS-H)

Star + selective peer edges
Complexity: O(r·n·k + p·n) turns, overhead 515 %, protocol fragile

3 The Numbers: From +81 % Win to −70 % Loss

Summary: Task structure, not team size, drives the delta.

3.1 Finance-Agent (Parallelizable)

Best: Centralized +80.9 % (0.631 vs 0.349 SAS)
Mechanism: Agents independently analyze revenue, cost, market; orchestrator synthesizes

3.2 BrowseComp-Plus (Dynamic)

Best: Decentralized +9.2 %
Mechanism: Peer debate corrects stale web data; too much hierarchy lags behind live pages

3.3 Workbench (Tool-Heavy but Linear)

Best: Decentralized +5.7 %
Reason: Gains from parallel tool calls, but 12-tool budget burns tokens fast

3.4 PlanCraft (Sequential)

Worst: Independent −70 %
Reason: Every crafting step mutates world state; agents work on divergent realities

Performance vs overhead scatter
Image source: Unsplash

4 The Predictive Formula: Plug in Your Own Task

Summary: A 20-coefficient mixed-effects model explains 51 % of variance on unseen configs—no dataset-specific parameters.

4.1 Key Interaction Terms (standardized)

Performance ≈
  0.256·I²                     // accelerating returns from smarter models
- 0.330·Ec×T                   // efficiency–tools trade-off (strongest)
- 0.141·O%×T                   // overhead explodes with tool count
- 0.408·PSA×log(1+n)           // baseline paradox: high PSA → negative returns

4.2 Worked Example

Task: internal BI dashboard generator

T = 14 tools, PSA = 0.63, I = 58, planned n = 4
Measured Ec ≈ 0.08, O% ≈ 400 %
Plug-in: multi-agent term = −0.330×0.08×14 − 0.141×4×14 − 0.408×0.63×1.6 ≈ −0.29 (standardized) → predict drop of ~15 raw accuracy points.
Decision: stay Single; shipped version confirms 8 % faster and 6 % fewer errors.

Author reflection: We used to argue “let’s just try four agents overnight.” Having an explicit threshold (PSA ≈ 0.45) turns the debate into a three-minute spreadsheet exercise—engineering time reclaimed.

5 Scenario Playbook: Three Common Task Types

Scenario	Tool Count	Decomposability	Architecture	Expected Δ	Caveat
Equity report	6	High	Centralized	+40–80 %	Orchestrator must handle numeric drift
Travel planner	10	Medium	Decentralized	+5–15 %	Bookings mutate fast; peer sync wins
Software build	16	Low	Single	0 %	Sequential compile steps, state lock

5.1 Code Snippet: Quick Evaluator

def predict_delta(I, T, PSA, n, Ec, O):
    # returns positive if multi-agent recommended
    return (0.256 * I**2
            - 0.330 * Ec * T
            - 0.141 * O * T
            - 0.408 * PSA * (1 + n).log())

Run it in a Jupyter cell against your next feature spec—takes 30 s.

6 Micro-Lessons from the Lab

Summary: Small details that don’t fit a table but save hours of pain.

Turn scaling exponent 1.724: beyond 4 agents, per-agent reasoning depth collapses—don’t chase 10× parallelism.
Message saturation: once density > 0.4 messages per turn, extra chatter adds < 2 % accuracy—kill the thread.
Error taxonomy: Logical contradiction and numerical drift dominate; Centralized cuts the former by 36 %, but Hybrid accidentally amplifies rounding errors by 10 %.
Family quirks: Anthropic models tolerate orchestrator load best; OpenAI Hybrid peaks higher but cost 3×; Gemini stays flat across topologies—great if you hate surprises.

Author reflection: I used to insist on “one architecture for all products.” The data screams “match topology to task.” We now keep a living sheet: every new customer scenario gets a 15-minute complexity score → architecture → budget. Simpler code, happier customers.

7 Action Checklist / Implementation Steps

Run single-agent baseline → record PSA and tool count T.
If PSA > 0.45 or T > 12 → default Single, skip to 7.
Score task decomposability (0–1): can sub-outputs be merged without order?
Estimate Ec and O% from small 3-agent dry run (100 traces enough).
Plug into predictor; positive Δ → proceed, negative → stay Single.
Choose architecture:
- High decomposability → Centralized
- Dynamic env → Decentralized
- Budget luxury + need creativity → Hybrid (accept 6× tokens)
Cap team at 4 agents; beyond this, latency grows super-linearly.
Monitor production: Ec < 0.1 or error amplification Ae > 10 → rollback.

One-page Overview

Adding agents only helps when three stars align: task can be split, tool count is modest, and single-agent accuracy is below ~45 %. A data-driven formula predicts the delta before deployment; 180 experiments show gains from +81 % to −70 % depending on task structure, not model brand. Centralized topology wins for parallel analytics, Decentralized for dynamic search, Hybrid only if budget is elastic. Keep teams ≤ 4 and always measure coordination overhead—token tax scales super-linearly with agent count.

FAQ

Q1: What is the most important single variable to check?
A: Tool count T interacting with efficiency Ec—this term dwarfs others by 57 %.

Q2: Can I stretch beyond 4 agents if latency is not an issue?
A: Token budget inflates as n^1.72; even with infinite money, per-agent reasoning depth becomes paper-thin.

Q3: Why does Independent always lose?
A: No error correction—mistakes multiply 17×; useful only for ultra-cheap ensembles where 30 % error is acceptable.

Q4: Does model family matter more than architecture?
A: No—architecture-task fit explains 3× more variance, but within the same task, picking the wrong family can swing 30 %.

Q5: Is the formula open source?
A: All coefficients and normalization stats are in the paper; no proprietary data needed to re-implement.

Q6: How do I measure Ec quickly?
A: Run 50 trials, record success and total turns; Ec = success / (turns / SAS_baseline_turns).

Q7: Will this hold for multimodal or robotics tasks?
A: Paper stays in text+tool domain; if your embodied task has similar sequential dependency and tool tax, the same trade-off logic applies but coefficients need re-fitting.

Scaling AI Agents: When More Models Hurt Performance & The Formula to Predict It