How to reliably control external crawlers and reduce crawl load — practical guide with nginx rate-limiting
Direct answer: Use
robots.txt
for cooperative guidance, but rely on server-side controls (nginx) for immediate, reliable protection. This article explains whyrobots.txt
sometimes doesn’t work, how to diagnose the problem, and how to implement a safe, production-ready nginx-based, per-user-agent rate limiting strategy that preserves access while protecting your servers.
What this article answers
Central question: How can I control aggressive crawlers (for example AhrefsBot) when robots.txt
changes don’t reduce crawl traffic, and what practical nginx configuration will reliably slow them down without disrupting normal users?
Short summary: this article walks through pragmatic diagnosis steps, shows robots.txt
examples, explains why changes may not take effect immediately, and provides two nginx strategies: an emergency reject (fast) method and a smooth, production-friendly limit_req
approach that is recommended for most situations. Full configs, testing commands, monitoring tips, and a compact action checklist are included.
Table of contents
-
Overview and quick answer -
How robots.txt
and Crawl-delay work (practical view) -
Why edits to robots.txt
sometimes appear ineffective — diagnosis checklist -
Emergency option: immediate reject with Retry-After
(nginx snippet) -
Recommended option: smooth rate limiting with limit_req
(complete nginx implementation) -
Parameter tuning, testing and monitoring commands -
CDN and proxy interaction considerations -
Application scenarios and operational examples -
Author reflection: lessons learned from hands-on ops work -
Action checklist / Implementation steps -
One-page overview -
FAQ (short Q&A)
Overview and quick answer
Direct answer: robots.txt
is your polite request — useful but not always honored or instantaneous. For immediate and reliable control, configure server-side limits. Use limit_req
in nginx to throttle by client IP for targeted user-agents (e.g., AhrefsBot), and use Retry-After
/ 429/503 for short emergency throttling.
Central question: Can I stop a crawler quickly if it’s overloading my server?
Yes — by combining a robots.txt
guidance file and server-side nginx controls. The guidance file is the long-term cooperative mechanism; the nginx controls are the enforcement layer you rely on for immediate protection.
How robots.txt
and Crawl-delay work (practical view)
Direct answer: robots.txt
sits at the site root and tells cooperative crawlers which paths to access and how often to wait via Crawl-delay
, but support for Crawl-delay
is inconsistent across crawlers.
Central question: What should I expect when I put Crawl-delay
in my robots.txt
?
Short summary: include Crawl-delay
for crawlers that respect it (many third-party crawlers do), but expect delays in effect and variable support. Use it as a friendly, standard guidance layer rather than a hard enforcement mechanism.
Core points (from our discussion)
-
robots.txt
must be placed at the root, e.g.https://yourdomain/robots.txt
. -
Format is simple; you can target specific user-agents:
User-agent: AhrefsBot
Crawl-delay: 5
-
Many crawlers (Ahrefs, Bing, Yandex) typically respect Crawl-delay
. Googlebot does not honorCrawl-delay
. -
Crawl-delay
is non-standard (not uniformly implemented). Relying on it alone can be insufficient.
Operational example
A working robots.txt
excerpt used in a real incident:
User-agent: *
Allow: /
User-agent: AhrefsBot
Crawl-delay: 5
User-agent: AhrefsSiteAudit
Crawl-delay: 5
This tells Ahrefs’ crawlers to wait 5 seconds between requests, while still allowing all user agents general access. In practice, you should confirm that this file is publicly accessible and not cached by a CDN.
Why edits to robots.txt
sometimes appear ineffective — diagnosis checklist
Direct answer: Edits may not take effect because of CDN caching, wrong file location, mismatch between logged user-agent strings and rules, or simply because the crawler doesn’t honor Crawl-delay
. Always check logs and CDN caches first.
Central question: I changed robots.txt
but traffic from a crawler didn’t drop — what should I check?
Short summary: confirm accessibility, rule matching, CDN cache state, and actual UA strings in your logs; if immediate relief is needed, apply server-level throttling.
Step-by-step diagnosis checklist
-
Confirm public availability
-
curl -I https://yourdomain/robots.txt
— ensure it returns 200 and the new content.
-
-
Check for CDN caching
-
If you use a CDN, robots.txt may be cached. Purge or update the CDN copy.
-
-
Review access logs
-
Inspect /var/log/nginx/access.log
or your web server logs to see the actual user-agent strings hitting your server:
tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
-
-
Confirm user-agent match
-
Some crawlers append versions or other tokens to their UA string; make sure your User-agent
rule matches the actual UA.
-
-
Allow for adjustment latency
-
Some crawlers may take hours or up to 1–2 days to change behavior after robots.txt
edits.
-
-
If urgent, apply server-level limits
-
Use nginx to immediately control request rates.
-
Common pitfalls
-
Placing robots.txt
in a subfolder or behind authentication. -
Not purging CDN caches. -
Matching User-agent
incorrectly (e.g., case-sensitive or exact match when the crawler uses varying tokens). -
Expecting robots.txt
to be an enforcement mechanism rather than a cooperative guideline.
Emergency option: immediate reject with Retry-After
(fast, short-term)
Direct answer: For urgent overloads, return a 429 or 503 with a Retry-After
header to targeted user-agents. This gets immediate respect from properly behaving crawlers.
Central question: How can I immediately make a crawler back off while I stabilize the server?
Short summary: use an nginx rule that detects the crawler UA and returns 429
or 503
with Retry-After
, which many respected bots will honor.
Why use this
-
Quick to implement and immediately reduces traffic from compliant crawlers. -
Appropriate for short maintenance windows or sudden spikes. -
Not recommended as a long-term solution because it temporarily blocks crawling.
Example nginx snippet (emergency)
Place this in your server configuration (inside server { ... }
):
http {
map $http_user_agent $is_ahrefs {
default 0;
"~*AhrefsBot" 1;
"~*AhrefsSiteAudit" 1;
}
server {
listen 80;
server_name example.com;
if ($is_ahrefs) {
add_header Retry-After "120";
return 429;
}
location / {
proxy_pass http://backend;
}
}
}
Notes
-
429 Too Many Requests
is explicit and indicates rate limiting. -
503 Service Unavailable
+Retry-After
is also commonly respected and emphasizes a temporary outage. -
Use only for short durations; long-term blocking can reduce legitimate indexing and monitoring.
Recommended option: smooth rate limiting with limit_req
(detailed guide)
Direct answer: Implement a limit_req_zone
and internal rewrite for targeted user-agents; this allows controlled, per-IP throttling with burst tolerance and minimizes collateral impact.
Central question: How do I implement a production-grade, smooth throttle for specific crawlers in nginx?
Short summary: use map
to detect the UA, limit_req_zone
to define a token bucket, and an internal location
for the limited path. This preserves access while smoothing load.
Why this approach
-
Smooth handling: allows short bursts without outright blocking. -
Targeted: only affects the specified UA strings. -
Configurable: tune rate
andburst
to balance load and crawler coverage.
Full nginx configuration (copy-and-paste ready)
Put the limit_req_zone
and map
in your http { ... }
context, and the server
block in your site configuration:
# http { ... } scope
map $http_user_agent $is_ahrefs {
default 0;
"~*AhrefsBot" 1;
"~*AhrefsSiteAudit" 1;
}
limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s;
limit_req_zone $binary_remote_addr zone=normal_zone:10m rate=20r/s;
# server { ... } scope
server {
listen 80;
server_name example.com;
location / {
if ($is_ahrefs) {
rewrite ^(.*)$ /__ahrefs_throttle$1 break;
}
limit_req zone=normal_zone burst=30 nodelay;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_pass http://backend;
}
location ~ ^/__ahrefs_throttle(.*) {
internal;
limit_req zone=ahrefs_zone burst=5 nodelay;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_pass http://backend$1$is_args$args;
}
location = /__throttle_test {
return 200 "ok\n";
}
}
Key configuration elements explained
-
map $http_user_agent $is_ahrefs
— sets a flag when the request UA matches a pattern. -
limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s
— defines a token bucket for per-IP limiting with average rate1 request/sec
. -
rewrite ... /__ahrefs_throttle$1 break
— internal rewrite to a special location that has strict limits. -
limit_req zone=ahrefs_zone burst=5 nodelay
— allows small bursts of extra requests, then enforces limits. -
internal
— prevents external clients from requesting the throttled location directly. -
proxy_pass http://backend$1$is_args$args
— forwards the original path and query string to the backend.
Behavior and defaults
-
When the limit is exceeded and nodelay
is present, nginx returns 503 by default (this signals the request was throttled). -
You can remove nodelay
to have nginx queue requests rather than reject them outright. -
burst
allows short spikes without immediate rejections; tune it carefully.
Parameter tuning: practical guidance
Direct answer: Start conservatively (e.g., rate=1r/s
, burst=5
) and monitor logs; adjust rate
and burst
based on observed bot behavior and server capacity.
Central question: What values should I set for rate
and burst
initially, and how do I change them safely?
Short summary: use 1r/s
as a practical starting point and reduce to 0.2r/s
for more conservative throttling; increase burst
if you expect short allowable surges.
Practical tuning tips (from the discussion)
-
Starting values
-
rate=1r/s
withburst=5
is a common starting point. -
To be more conservative, rate=0.2r/s
approximates 1 request every 5 seconds.
-
-
Burst behavior
-
burst=5
allows occasional spikes without rejecting immediately. -
Larger burst
accepts more spikes; smallerburst
tightens control.
-
-
Zone size
-
zone=ahrefs_zone:10m
typically suffices; increase if you have many concurrent IPs.
-
-
nodelay
-
With nodelay
, extra requests are rejected immediately (503). -
Without nodelay
, extra requests are delayed (queued) instead.
-
Monitoring adjustments
-
Start with the conservative settings and watch logs for false positives (legitimate clients being throttled). -
Adjust if you see too many 503 responses or if bot traffic still overloads the server.
Testing and monitoring: concrete commands and checks
Direct answer: Use nginx -t
then systemctl reload nginx
, test with curl
using different User-Agent strings, and monitor access logs filtered for relevant user-agents.
Central question: How do I verify the configuration works and monitor its effect?
Short summary: validate nginx configs, simulate requests, and use live log monitoring to observe user-agent behavior and throttling responses.
Commands to run
-
Test and reload nginx:
nginx -t && systemctl reload nginx
-
Simulate requests:
# Simulate normal user
curl -I -A "Mozilla/5.0" http://example.com/
# Simulate Ahrefs user-agent
curl -I -A "AhrefsBot" http://example.com/somepath
-
Watch logs for targeted user-agent entries:
tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
-
Find 503 responses (indicative of throttle rejections):
grep " 503 " /var/log/nginx/access.log | head
What to look for
-
Confirm that requests with Ahrefs UA are routed to the throttled internal location. -
Look for 503 responses if throttling hits the limit with nodelay
. -
Verify that normal UA requests are unaffected.
CDN and reverse proxy considerations
Direct answer: If you place nginx behind a CDN or load balancer, make sure to use the real client IP (realip
) so per-IP rate limiting works properly.
Central question: My nginx is behind a CDN — how does that affect the rate limiting?
Short summary: without the real client IP, rate limits might be applied per CDN node IP rather than per client, so configure real IP headers and trust only known CDN IP ranges.
Minimal real IP configuration
Place in http { ... }
:
real_ip_header X-Forwarded-For;
set_real_ip_from 0.0.0.0/0;
Caution: in production, replace 0.0.0.0/0
with explicit CDN IP ranges to avoid trusting unverified sources.
Practical effect
-
With real IP configured, $binary_remote_addr
refers to the true client IP, making the per-IP limiting effective. -
Without it, throttling may appear ineffective or over-aggressive because many clients share the same CDN edge IPs.
Application scenarios and operational examples
Direct answer: Use robots.txt
and nginx together to handle scenarios ranging from a benign indexing increase to emergency bot storms and regular scheduled crawling.
Central question: In what real situations should I pick robots.txt
vs nginx emergency reject vs limit_req
?
Short summary: For routine, low-impact changes, use robots.txt
. For sudden load or immediate problems, use emergency reject. For sustained, controlled throttling while preserving access, use limit_req
.
Scenario 1 — Content site with frequent updates
Problem: Content site publishes daily; third-party crawlers increase activity and cause spikes.
Response: Add Crawl-delay
entries in robots.txt
and monitor. If spikes persist, deploy limit_req
with moderate rate
(e.g., 1r/s
) to smooth load.
Operational example:
-
Add to robots.txt
:
User-agent: *
Allow: /
User-agent: AhrefsBot
Crawl-delay: 5
-
Deploy limit_req
config withrate=1r/s, burst=5
and monitor for 24–48 hours.
Scenario 2 — Sudden bot storm causing timeouts
Problem: Third-party site audit tool starts scanning heavily and your endpoints timeout.
Response: Use emergency nginx snippet to return 429
+ Retry-After
(temporary) until you apply smoother throttling.
Operational example:
-
Emergency config returns 429
for matching UA andRetry-After: 120
. -
After traffic stabilizes, replace emergency rule with limit_req
approach.
Scenario 3 — Large site with CDN in front
Problem: Many pages, high bot coverage. CDN hides real IP addresses.
Response: Configure realip
correctly, ensure CDN forwards X-Forwarded-For, and use limit_req
per client IP. Monitor CDN logs as well.
Operational example:
-
Ensure real_ip_header
andset_real_ip_from
are configured with CDN IP ranges. -
Deploy limit_req
withzone=10m
and adjust based on server capacity.
Author reflection / lessons learned
Direct answer: Practical protection needs both polite cooperation and enforceable controls; in real incidents, combining robots.txt
with server-side throttling saved services without long-term damage to monitoring or indexing.
Reflection: From the operational examples above, my experience is that robots.txt
is the right first step — it documents intent and helps well-behaved crawlers — but the server is where you must keep control. The limit_req
pattern with an internal rewrite provides a reliable balance: it preserves access while preventing overload. In emergencies, a short 429
/503
with Retry-After
buys you time to implement smoother controls.
Action Checklist / Implementation Steps
Direct answer: Follow these prioritized steps to go from detection to protection quickly and safely.
-
Verify
robots.txt
is correct and public-
curl -I https://yourdomain/robots.txt
-
-
Check access logs for real UA strings
-
tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
-
-
Purge CDN cache if applicable
-
Ensure the updated robots.txt
is delivered.
-
-
If urgent, deploy emergency reject
-
Add the map
andif ($is_ahrefs) { add_header Retry-After "120"; return 429; }
snippet, reload nginx.
-
-
Deploy smooth
limit_req
-
Add map
,limit_req_zone
inhttp {}
, internal rewrite and throttledlocation
inserver {}
.
-
-
Test with curl and monitor logs
-
curl -I -A "AhrefsBot" http://example.com/somepath
-
Monitor access and error logs for 24–48 hours.
-
-
Tune
rate
,burst
, andzone
size-
Adjust to balance server load and crawler coverage.
-
-
If behind CDN, configure
real_ip_header
andset_real_ip_from
-
Use CDN IP ranges, not 0.0.0.0/0
in production.
-
-
Document the change and revert plan
-
Keep a short rollback plan in case of unexpected side effects.
-
One-page overview (quick reference)
Problem: A crawler is overloading the site and robots.txt
changes don’t provide immediate relief.
Immediate response:
-
Emergency: return 429
or503
withRetry-After
for matching UA via nginx.
Recommended long-term:
-
Use
limit_req
with internal rewrite:-
Detect UA via map $http_user_agent
. -
Define limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s
. -
Rewrite matched UA to an internal throttled location protected by limit_req
.
-
Testing: nginx -t && systemctl reload nginx
, then curl -I -A "AhrefsBot" ...
and monitor logs.
CDN note: Ensure real client IP is visible to nginx (real_ip_header
+ set_real_ip_from
) so throttling behaves per client IP.
Tuning start values: rate=1r/s
, burst=5
, zone=10m
; conservative alternative rate=0.2r/s
.
FAQ (concise answers based on article content)
Q1: I changed robots.txt but the crawler still hits hard; what do I check first?
A1: Verify robots.txt
is publicly available, purge CDN cache, and inspect access logs to confirm the crawler’s exact UA string.
Q2: How do I immediately stop a crawler that is overwhelming my server?
A2: Use an emergency nginx rule that detects the crawler UA and returns 429
or 503
with a Retry-After
header; then follow up with smoother throttling.
Q3: What is a safe starting limit_req
configuration?
A3: limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s;
with burst=5
is a reasonable start. Use rate=0.2r/s
for more conservative throttling.
Q4: Will this throttle affect normal users?
A4: If UA detection is precise and real client IP is used, normal users should be unaffected. Monitor logs to catch any false positives.
Q5: What if my server is behind a CDN?
A5: Configure real_ip_header
and set_real_ip_from
so nginx sees the real client IP, and prefer explicit CDN IP ranges in set_real_ip_from
.
Q6: How long until Crawl-delay
in robots.txt is respected?
A6: It varies; some crawlers adjust within hours, others may take longer. Do not expect instant enforcement.
Q7: Which response code should I use for emergency rejection?
A7: 429
(Too Many Requests) is clear for rate limiting; 503
with Retry-After
is also commonly used for temporary unavailability. Both are respected by well-behaved crawlers.
Q8: How do I test the change?
A8: Use curl
with the crawler UA string and monitor nginx access logs for expected routing and response codes.