How a Single Permission Change Nearly Shut Down the Internet

A Forensic Analysis of the Cloudflare November 18 Outage (Technical Deep Dive)

Stance Declaration
This article includes analytical judgment about Cloudflare’s architecture, operational processes, and systemic risks. These judgments are based solely on the official incident report provided and should be considered professional interpretation—not definitive statements of fact.

**1. Introduction: An Internet-Scale Outage That Was Not an Attack**

On November 18, 2025, Cloudflare—the backbone for a significant portion of the global Internet—experienced its most severe outage since 2019. Websites across the world began returning HTTP 5xx errors, authentication systems failed, dashboards could not load, and multiple Cloudflare services simultaneously degraded.

At first glance, the incident looked eerily similar to a massive DDoS attack or a coordinated supply-chain compromise.

Surprisingly, Cloudflare’s postmortem clarified:

No attack was involved.
The root cause was a database permission change that triggered an unexpected explosion in the size of a key machine-learning “feature file”—a shared input used by Cloudflare’s Bot Management module.

This seemingly harmless internal change caused:

Cloudflare’s core proxy engines (FL and FL2) to crash
KV storage to fail
Access authentication to break
Turnstile verification to stop loading
The Cloudflare Dashboard to become unreachable
Global traffic flow to repeatedly collapse and recover

And all of this because a single file grew larger than expected.

This is an anatomy of that failure—and what it means for the future of Internet infrastructure.

2. The Global Timeline: From 11:05 to 17:06 UTC

The outage unfolded over roughly six hours, with a chain of escalating failures.

timeline
    title Cloudflare November 18 Outage Timeline
    11:05 : ClickHouse permission change deployed  
    11:20 : Bot feature file begins generating incorrect entries  
    11:28 : First global HTTP 5xx errors observed  
    11:32-13:05 : KV and Access services show massive degradation  
    13:05 : Emergency bypass for KV and Access deployed  
    13:37 : Decision made to roll back Bot feature file  
    14:24 : Feature file generation halted  
    14:30 : Core traffic flow returns to normal  
    17:06 : All Cloudflare systems fully restored

This timeline highlights one uncomfortable truth:

The entire outage stemmed from a single systemic assumption that broke at 11:05.

3. The Root Cause: A Permission Change with Unintended Consequences

Cloudflare uses ClickHouse to generate machine-learning “feature files” consumed by its Bot Management system. These files are:

Updated every few minutes
Distributed globally
Consumed by every Cloudflare proxy node

The problem started when Cloudflare updated database permissions so that internal queries could also see the underlying r0 tables—not just the default database.

Under normal conditions, a query such as:

SELECT name, type 
FROM system.columns 
WHERE table = 'http_requests_features'
ORDER BY name;

would only return metadata for tables in default.

After the permission change:

The same query also returned metadata for r0 (the underlying tables)
Resulting in duplicate or redundant feature rows
Doubling (or more) the number of ML features detected
Causing the feature file to exceed internal limits baked into Cloudflare’s proxy engine

Cloudflare’s ML module enforces a strict cap (~200 features), primarily for memory preallocation and runtime safety.

Once the new oversized file propagated:

The Bot Management module crashed.
The proxy engine panicked.
All traffic depending on FL and FL2 began failing.

4. Architectural Insight: Why a Single File Could Break the Entire Network

Cloudflare’s strength—its unified global proxy architecture—is also its Achilles’ heel.

A simplified view of the Cloudflare request pipeline looks like this:

flowchart TD
    A[Client Request] --> B[TLS/HTTP Termination]
    B --> C[Core Proxy (FL/FL2)]
    C --> D[Security Modules: WAF, Bot, DDoS, Firewalls]
    D --> E[Cache / Workers / R2 / Origin Fetch]

Because the Bot Management module sits inside the critical path:

Every request passes through it
Every Cloudflare node loads the same feature file
Any invalid file becomes a global single point of failure

No fallback.
No graceful degradation.
No isolation between modules.

This is why:

A single malformed internal file caused a global outage.

5. The Confusing Symptom: Why the Outage Looked Like a DDoS Attack

One of the most important—and misleading—elements of this outage was the cyclical pattern of failures.

Cloudflare observed:

Traffic failing for several minutes
Then mysteriously recovering
Then failing again
Repeating every 5 minutes

This behavior closely resembled:

Adaptive botnet bursts
Pulsing DDoS attacks
Traffic exhaustion cycles

The real reason was far more mundane:

ClickHouse nodes were at different stages of permission rollout
Some nodes generated correct feature files
Others generated broken ones
Each new file propagated globally
Causing the proxy to alternately crash → recover → crash again

Even worse:

Cloudflare’s own Status Page coincidentally went down
Leading internal responders to suspect a multi-vector attack

This misdirection consumed critical investigation time.

6. The Cascade Failure: KV, Access, Turnstile, Dashboard

Once the core proxy engines failed, multiple Cloudflare subsystems collapsed.

6.1 Workers KV

KV depends heavily on Cloudflare’s proxy tier.

Proxy panic → KV gateway unreachable
Result: Massive 5xx error spikes

6.2 Cloudflare Access

Access authentication relies on KV and the proxy:

KV failures prevented Access session lookups
New logins failed
Existing sessions stayed valid but could not be refreshed

6.3 Turnstile

Turnstile (CAPTCHA alternative):

Could not load because Access and the proxy were unstable
This broke Cloudflare Dashboard logins

6.4 Dashboard

The Cloudflare Dashboard:

Used Turnstile for login
Used KV for config operations
Became nearly unreachable

This is a textbook example of horizontal dependency collapse:

flowchart TD
    A[Bot Feature File Error] --> B[Proxy Panic]
    B --> C[KV Gateway Fails]
    C --> D[Access Auth Fails]
    D --> E[Turnstile Unusable]
    E --> F[Dashboard Unreachable]

A single failing component cascaded across the entire Cloudflare ecosystem.

7. Technical Deep Dive: How the Feature File “Blew Up”

Before the permission change:

~60 ML features
Fit comfortably inside the proxy’s memory limits

After the permission change:

Duplicate r0 metadata rows doubled feature count
Some features appeared more than twice
Total count exceeded the hard limit (~200)

Cloudflare’s internal code (simplified):

if features.len() > MAX_FEATURES {
    panic!("too many features");
}

This is:

Valid Rust
Correct behavior under normal circumstances
Extremely dangerous when input assumptions are violated

This single panic:

Crashed a core proxy thread
Produced 5xx errors
Disrupted all downstream products

8. The Real Issue: A Broken System Assumption, Not a Bug

Cloudflare’s incident reveals a key engineering truth:

Most large-scale outages are not caused by bugs.
They are caused by assumptions that silently stop being true.

Cloudflare assumed:

Permission changes would not alter metadata query output
Feature files would always be small
Internal files did not require strict validation
The ML module would never receive invalid data
The proxy would never load a malformed file

Once any of these assumptions breaks, the system behaves unpredictably.

This is a classic example of systemic fragility in a high-coupling architecture.

9. Industry Comparison: Why AWS and Google Cloud Rarely Suffer Global Failures

Cloudflare’s architecture is unique:

Platform	Architectural Style	Failure Propagation
Cloudflare	Single global proxy layer; unified config	Local fault → global outage
AWS	Region isolation; AZ sandboxing	Region fault → limited blast radius
GCP	Service-level isolation; strong canary culture	Misconfiguration → localized impact

Cloudflare’s advantages:

Consistent performance
Global propagation in minutes
Unified rules and execution

Cloudflare’s risk:

Misconfigurations propagate at the speed of automation
Internal files are global dependencies
Module coupling creates global single points of failure

This outage exposes the downside of Cloudflare’s “global uniformity” model.

10. Forward-Looking Predictions: What Cloudflare Is Likely to Change

(Speculative but logically justified)

10.1 Sandbox and Validate Feature Files

Cloudflare will likely introduce:

Schema validation
Size and duplication checks
Static analysis
Canary rollout
Shadow traffic evaluation

10.2 Decoupling Bot Management from Critical Path

Future design may include:

Independent scoring service
Graceful fallback when ML modules fail
Less reliance on proxy-embedded ML logic
Error tolerance instead of panic behavior

10.3 Treat Internal Files as “Untrusted Inputs”

Cloudflare may apply Zero-Trust principles to its own automated outputs:

No file is assumed safe
All must be validated
Unexpected growth is treated as attack-like behavior

This would mark a shift from trust-based internal automation
→ zero-trust configuration pipelines.

11. The Broader Meaning for the Internet Industry

The Cloudflare outage should be seen as a warning for all cloud providers and platform engineers:

1. The biggest threats now come from inside the system—not outside.

Automation amplifies mistakes faster than humans can contain them.

2. High-coupling architectures are efficient—and dangerously fragile.

Cloudflare optimized for speed, but at the cost of resilience.

3. Internal data pipelines are the new supply chain.

The industry focuses heavily on external supply-chain security,
but forgets internally generated content can be just as harmful.

**4. Assumptions will break.

The question is whether systems can survive the moment they do.**

This outage is not a Cloudflare failure alone—
it is a case study for modern distributed systems.

12. Final Summary: The Entire Outage in 10 Key Points (80/20 Insight)

A ClickHouse permission change altered metadata query results.
Duplicate metadata inflated a critical ML feature file.
The file overflowed the proxy’s built-in limits.
Proxy engines (FL/FL2) panicked and returned 5xx globally.
KV, Access, Turnstile, and Dashboard collapsed as collateral damage.
Cyclic failures every 5 minutes misled responders into suspecting a DDoS.
The real cause was a broken assumption—not an external actor.
Cloudflare’s global uniform architecture magnified the blast radius.
AWS/GCP avoid similar failures through regional isolation.
The event highlights the growing risk of internal automation failures.

How a Single Permission Change Nearly Broke the Internet: Cloudflare’s 2025 Outage Explained