How a Single Permission Change Nearly Shut Down the Internet
A Forensic Analysis of the Cloudflare November 18 Outage (Technical Deep Dive)
Stance Declaration
This article includes analytical judgment about Cloudflare’s architecture, operational processes, and systemic risks. These judgments are based solely on the official incident report provided and should be considered professional interpretation—not definitive statements of fact.
1. Introduction: An Internet-Scale Outage That Was Not an Attack
On November 18, 2025, Cloudflare—the backbone for a significant portion of the global Internet—experienced its most severe outage since 2019. Websites across the world began returning HTTP 5xx errors, authentication systems failed, dashboards could not load, and multiple Cloudflare services simultaneously degraded.
At first glance, the incident looked eerily similar to a massive DDoS attack or a coordinated supply-chain compromise.
Surprisingly, Cloudflare’s postmortem clarified:
No attack was involved.
The root cause was a database permission change that triggered an unexpected explosion in the size of a key machine-learning “feature file”—a shared input used by Cloudflare’s Bot Management module.
This seemingly harmless internal change caused:
-
Cloudflare’s core proxy engines (FL and FL2) to crash -
KV storage to fail -
Access authentication to break -
Turnstile verification to stop loading -
The Cloudflare Dashboard to become unreachable -
Global traffic flow to repeatedly collapse and recover
And all of this because a single file grew larger than expected.
This is an anatomy of that failure—and what it means for the future of Internet infrastructure.
2. The Global Timeline: From 11:05 to 17:06 UTC
The outage unfolded over roughly six hours, with a chain of escalating failures.
timeline
title Cloudflare November 18 Outage Timeline
11:05 : ClickHouse permission change deployed
11:20 : Bot feature file begins generating incorrect entries
11:28 : First global HTTP 5xx errors observed
11:32-13:05 : KV and Access services show massive degradation
13:05 : Emergency bypass for KV and Access deployed
13:37 : Decision made to roll back Bot feature file
14:24 : Feature file generation halted
14:30 : Core traffic flow returns to normal
17:06 : All Cloudflare systems fully restored
This timeline highlights one uncomfortable truth:
The entire outage stemmed from a single systemic assumption that broke at 11:05.
3. The Root Cause: A Permission Change with Unintended Consequences
Cloudflare uses ClickHouse to generate machine-learning “feature files” consumed by its Bot Management system. These files are:
-
Updated every few minutes -
Distributed globally -
Consumed by every Cloudflare proxy node
The problem started when Cloudflare updated database permissions so that internal queries could also see the underlying r0 tables—not just the default database.
Under normal conditions, a query such as:
SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
would only return metadata for tables in default.
After the permission change:
-
The same query also returned metadata for r0 (the underlying tables) -
Resulting in duplicate or redundant feature rows -
Doubling (or more) the number of ML features detected -
Causing the feature file to exceed internal limits baked into Cloudflare’s proxy engine
Cloudflare’s ML module enforces a strict cap (~200 features), primarily for memory preallocation and runtime safety.
Once the new oversized file propagated:
The Bot Management module crashed.
The proxy engine panicked.
All traffic depending on FL and FL2 began failing.
4. Architectural Insight: Why a Single File Could Break the Entire Network
Cloudflare’s strength—its unified global proxy architecture—is also its Achilles’ heel.
A simplified view of the Cloudflare request pipeline looks like this:
flowchart TD
A[Client Request] --> B[TLS/HTTP Termination]
B --> C[Core Proxy (FL/FL2)]
C --> D[Security Modules: WAF, Bot, DDoS, Firewalls]
D --> E[Cache / Workers / R2 / Origin Fetch]
Because the Bot Management module sits inside the critical path:
-
Every request passes through it -
Every Cloudflare node loads the same feature file -
Any invalid file becomes a global single point of failure
No fallback.
No graceful degradation.
No isolation between modules.
This is why:
A single malformed internal file caused a global outage.
5. The Confusing Symptom: Why the Outage Looked Like a DDoS Attack
One of the most important—and misleading—elements of this outage was the cyclical pattern of failures.
Cloudflare observed:
-
Traffic failing for several minutes -
Then mysteriously recovering -
Then failing again -
Repeating every 5 minutes
This behavior closely resembled:
-
Adaptive botnet bursts -
Pulsing DDoS attacks -
Traffic exhaustion cycles
The real reason was far more mundane:
-
ClickHouse nodes were at different stages of permission rollout -
Some nodes generated correct feature files -
Others generated broken ones -
Each new file propagated globally -
Causing the proxy to alternately crash → recover → crash again
Even worse:
-
Cloudflare’s own Status Page coincidentally went down -
Leading internal responders to suspect a multi-vector attack
This misdirection consumed critical investigation time.
6. The Cascade Failure: KV, Access, Turnstile, Dashboard
Once the core proxy engines failed, multiple Cloudflare subsystems collapsed.
6.1 Workers KV
KV depends heavily on Cloudflare’s proxy tier.
-
Proxy panic → KV gateway unreachable -
Result: Massive 5xx error spikes
6.2 Cloudflare Access
Access authentication relies on KV and the proxy:
-
KV failures prevented Access session lookups -
New logins failed -
Existing sessions stayed valid but could not be refreshed
6.3 Turnstile
Turnstile (CAPTCHA alternative):
-
Could not load because Access and the proxy were unstable -
This broke Cloudflare Dashboard logins
6.4 Dashboard
The Cloudflare Dashboard:
-
Used Turnstile for login -
Used KV for config operations -
Became nearly unreachable
This is a textbook example of horizontal dependency collapse:
flowchart TD
A[Bot Feature File Error] --> B[Proxy Panic]
B --> C[KV Gateway Fails]
C --> D[Access Auth Fails]
D --> E[Turnstile Unusable]
E --> F[Dashboard Unreachable]
A single failing component cascaded across the entire Cloudflare ecosystem.
7. Technical Deep Dive: How the Feature File “Blew Up”
Before the permission change:
-
~60 ML features -
Fit comfortably inside the proxy’s memory limits
After the permission change:
-
Duplicate r0 metadata rows doubled feature count -
Some features appeared more than twice -
Total count exceeded the hard limit (~200)
Cloudflare’s internal code (simplified):
if features.len() > MAX_FEATURES {
panic!("too many features");
}
This is:
-
Valid Rust -
Correct behavior under normal circumstances -
Extremely dangerous when input assumptions are violated
This single panic:
-
Crashed a core proxy thread -
Produced 5xx errors -
Disrupted all downstream products
8. The Real Issue: A Broken System Assumption, Not a Bug
Cloudflare’s incident reveals a key engineering truth:
Most large-scale outages are not caused by bugs.
They are caused by assumptions that silently stop being true.
Cloudflare assumed:
-
Permission changes would not alter metadata query output -
Feature files would always be small -
Internal files did not require strict validation -
The ML module would never receive invalid data -
The proxy would never load a malformed file
Once any of these assumptions breaks, the system behaves unpredictably.
This is a classic example of systemic fragility in a high-coupling architecture.
9. Industry Comparison: Why AWS and Google Cloud Rarely Suffer Global Failures
Cloudflare’s architecture is unique:
| Platform | Architectural Style | Failure Propagation |
|---|---|---|
| Cloudflare | Single global proxy layer; unified config | Local fault → global outage |
| AWS | Region isolation; AZ sandboxing | Region fault → limited blast radius |
| GCP | Service-level isolation; strong canary culture | Misconfiguration → localized impact |
Cloudflare’s advantages:
-
Consistent performance -
Global propagation in minutes -
Unified rules and execution
Cloudflare’s risk:
-
Misconfigurations propagate at the speed of automation -
Internal files are global dependencies -
Module coupling creates global single points of failure
This outage exposes the downside of Cloudflare’s “global uniformity” model.
10. Forward-Looking Predictions: What Cloudflare Is Likely to Change
(Speculative but logically justified)
10.1 Sandbox and Validate Feature Files
Cloudflare will likely introduce:
-
Schema validation -
Size and duplication checks -
Static analysis -
Canary rollout -
Shadow traffic evaluation
10.2 Decoupling Bot Management from Critical Path
Future design may include:
-
Independent scoring service -
Graceful fallback when ML modules fail -
Less reliance on proxy-embedded ML logic -
Error tolerance instead of panic behavior
10.3 Treat Internal Files as “Untrusted Inputs”
Cloudflare may apply Zero-Trust principles to its own automated outputs:
-
No file is assumed safe -
All must be validated -
Unexpected growth is treated as attack-like behavior
This would mark a shift from trust-based internal automation
→ zero-trust configuration pipelines.
11. The Broader Meaning for the Internet Industry
The Cloudflare outage should be seen as a warning for all cloud providers and platform engineers:
1. The biggest threats now come from inside the system—not outside.
Automation amplifies mistakes faster than humans can contain them.
2. High-coupling architectures are efficient—and dangerously fragile.
Cloudflare optimized for speed, but at the cost of resilience.
3. Internal data pipelines are the new supply chain.
The industry focuses heavily on external supply-chain security,
but forgets internally generated content can be just as harmful.
**4. Assumptions will break.
The question is whether systems can survive the moment they do.**
This outage is not a Cloudflare failure alone—
it is a case study for modern distributed systems.
12. Final Summary: The Entire Outage in 10 Key Points (80/20 Insight)
-
A ClickHouse permission change altered metadata query results. -
Duplicate metadata inflated a critical ML feature file. -
The file overflowed the proxy’s built-in limits. -
Proxy engines (FL/FL2) panicked and returned 5xx globally. -
KV, Access, Turnstile, and Dashboard collapsed as collateral damage. -
Cyclic failures every 5 minutes misled responders into suspecting a DDoS. -
The real cause was a broken assumption—not an external actor. -
Cloudflare’s global uniform architecture magnified the blast radius. -
AWS/GCP avoid similar failures through regional isolation. -
The event highlights the growing risk of internal automation failures.

