How to reliably control external crawlers and reduce crawl load — practical guide with nginx rate-limiting

Direct answer: Use robots.txt for cooperative guidance, but rely on server-side controls (nginx) for immediate, reliable protection. This article explains why robots.txt sometimes doesn’t work, how to diagnose the problem, and how to implement a safe, production-ready nginx-based, per-user-agent rate limiting strategy that preserves access while protecting your servers.


What this article answers

Central question: How can I control aggressive crawlers (for example AhrefsBot) when robots.txt changes don’t reduce crawl traffic, and what practical nginx configuration will reliably slow them down without disrupting normal users?

Short summary: this article walks through pragmatic diagnosis steps, shows robots.txt examples, explains why changes may not take effect immediately, and provides two nginx strategies: an emergency reject (fast) method and a smooth, production-friendly limit_req approach that is recommended for most situations. Full configs, testing commands, monitoring tips, and a compact action checklist are included.


Table of contents

  • Overview and quick answer
  • How robots.txt and Crawl-delay work (practical view)
  • Why edits to robots.txt sometimes appear ineffective — diagnosis checklist
  • Emergency option: immediate reject with Retry-After (nginx snippet)
  • Recommended option: smooth rate limiting with limit_req (complete nginx implementation)
  • Parameter tuning, testing and monitoring commands
  • CDN and proxy interaction considerations
  • Application scenarios and operational examples
  • Author reflection: lessons learned from hands-on ops work
  • Action checklist / Implementation steps
  • One-page overview
  • FAQ (short Q&A)

Overview and quick answer

Direct answer: robots.txt is your polite request — useful but not always honored or instantaneous. For immediate and reliable control, configure server-side limits. Use limit_req in nginx to throttle by client IP for targeted user-agents (e.g., AhrefsBot), and use Retry-After / 429/503 for short emergency throttling.

Central question: Can I stop a crawler quickly if it’s overloading my server?
Yes — by combining a robots.txt guidance file and server-side nginx controls. The guidance file is the long-term cooperative mechanism; the nginx controls are the enforcement layer you rely on for immediate protection.


How robots.txt and Crawl-delay work (practical view)

Direct answer: robots.txt sits at the site root and tells cooperative crawlers which paths to access and how often to wait via Crawl-delay, but support for Crawl-delay is inconsistent across crawlers.

Central question: What should I expect when I put Crawl-delay in my robots.txt?
Short summary: include Crawl-delay for crawlers that respect it (many third-party crawlers do), but expect delays in effect and variable support. Use it as a friendly, standard guidance layer rather than a hard enforcement mechanism.

Core points (from our discussion)

  • robots.txt must be placed at the root, e.g. https://yourdomain/robots.txt.
  • Format is simple; you can target specific user-agents:
User-agent: AhrefsBot
Crawl-delay: 5
  • Many crawlers (Ahrefs, Bing, Yandex) typically respect Crawl-delay. Googlebot does not honor Crawl-delay.
  • Crawl-delay is non-standard (not uniformly implemented). Relying on it alone can be insufficient.

Operational example

A working robots.txt excerpt used in a real incident:

User-agent: *
Allow: /

User-agent: AhrefsBot
Crawl-delay: 5

User-agent: AhrefsSiteAudit
Crawl-delay: 5

This tells Ahrefs’ crawlers to wait 5 seconds between requests, while still allowing all user agents general access. In practice, you should confirm that this file is publicly accessible and not cached by a CDN.


Why edits to robots.txt sometimes appear ineffective — diagnosis checklist

Direct answer: Edits may not take effect because of CDN caching, wrong file location, mismatch between logged user-agent strings and rules, or simply because the crawler doesn’t honor Crawl-delay. Always check logs and CDN caches first.

Central question: I changed robots.txt but traffic from a crawler didn’t drop — what should I check?
Short summary: confirm accessibility, rule matching, CDN cache state, and actual UA strings in your logs; if immediate relief is needed, apply server-level throttling.

Step-by-step diagnosis checklist

  1. Confirm public availability

    • curl -I https://yourdomain/robots.txt — ensure it returns 200 and the new content.
  2. Check for CDN caching

    • If you use a CDN, robots.txt may be cached. Purge or update the CDN copy.
  3. Review access logs

    • Inspect /var/log/nginx/access.log or your web server logs to see the actual user-agent strings hitting your server:
    tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
    
  4. Confirm user-agent match

    • Some crawlers append versions or other tokens to their UA string; make sure your User-agent rule matches the actual UA.
  5. Allow for adjustment latency

    • Some crawlers may take hours or up to 1–2 days to change behavior after robots.txt edits.
  6. If urgent, apply server-level limits

    • Use nginx to immediately control request rates.

Common pitfalls

  • Placing robots.txt in a subfolder or behind authentication.
  • Not purging CDN caches.
  • Matching User-agent incorrectly (e.g., case-sensitive or exact match when the crawler uses varying tokens).
  • Expecting robots.txt to be an enforcement mechanism rather than a cooperative guideline.

Emergency option: immediate reject with Retry-After (fast, short-term)

Direct answer: For urgent overloads, return a 429 or 503 with a Retry-After header to targeted user-agents. This gets immediate respect from properly behaving crawlers.

Central question: How can I immediately make a crawler back off while I stabilize the server?
Short summary: use an nginx rule that detects the crawler UA and returns 429 or 503 with Retry-After, which many respected bots will honor.

Why use this

  • Quick to implement and immediately reduces traffic from compliant crawlers.
  • Appropriate for short maintenance windows or sudden spikes.
  • Not recommended as a long-term solution because it temporarily blocks crawling.

Example nginx snippet (emergency)

Place this in your server configuration (inside server { ... }):

http {
    map $http_user_agent $is_ahrefs {
        default 0;
        "~*AhrefsBot" 1;
        "~*AhrefsSiteAudit" 1;
    }

    server {
        listen 80;
        server_name example.com;

        if ($is_ahrefs) {
            add_header Retry-After "120";
            return 429;
        }

        location / {
            proxy_pass http://backend;
        }
    }
}

Notes

  • 429 Too Many Requests is explicit and indicates rate limiting.
  • 503 Service Unavailable + Retry-After is also commonly respected and emphasizes a temporary outage.
  • Use only for short durations; long-term blocking can reduce legitimate indexing and monitoring.

Recommended option: smooth rate limiting with limit_req (detailed guide)

Direct answer: Implement a limit_req_zone and internal rewrite for targeted user-agents; this allows controlled, per-IP throttling with burst tolerance and minimizes collateral impact.

Central question: How do I implement a production-grade, smooth throttle for specific crawlers in nginx?
Short summary: use map to detect the UA, limit_req_zone to define a token bucket, and an internal location for the limited path. This preserves access while smoothing load.

Why this approach

  • Smooth handling: allows short bursts without outright blocking.
  • Targeted: only affects the specified UA strings.
  • Configurable: tune rate and burst to balance load and crawler coverage.

Full nginx configuration (copy-and-paste ready)

Put the limit_req_zone and map in your http { ... } context, and the server block in your site configuration:

# http { ... } scope
map $http_user_agent $is_ahrefs {
    default 0;
    "~*AhrefsBot" 1;
    "~*AhrefsSiteAudit" 1;
}

limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s;
limit_req_zone $binary_remote_addr zone=normal_zone:10m rate=20r/s;

# server { ... } scope
server {
    listen 80;
    server_name example.com;

    location / {
        if ($is_ahrefs) {
            rewrite ^(.*)$ /__ahrefs_throttle$1 break;
        }

        limit_req zone=normal_zone burst=30 nodelay;

        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_pass http://backend;
    }

    location ~ ^/__ahrefs_throttle(.*) {
        internal;
        limit_req zone=ahrefs_zone burst=5 nodelay;

        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_pass http://backend$1$is_args$args;
    }

    location = /__throttle_test {
        return 200 "ok\n";
    }
}

Key configuration elements explained

  • map $http_user_agent $is_ahrefs — sets a flag when the request UA matches a pattern.
  • limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s — defines a token bucket for per-IP limiting with average rate 1 request/sec.
  • rewrite ... /__ahrefs_throttle$1 break — internal rewrite to a special location that has strict limits.
  • limit_req zone=ahrefs_zone burst=5 nodelay — allows small bursts of extra requests, then enforces limits.
  • internal — prevents external clients from requesting the throttled location directly.
  • proxy_pass http://backend$1$is_args$args — forwards the original path and query string to the backend.

Behavior and defaults

  • When the limit is exceeded and nodelay is present, nginx returns 503 by default (this signals the request was throttled).
  • You can remove nodelay to have nginx queue requests rather than reject them outright.
  • burst allows short spikes without immediate rejections; tune it carefully.

Parameter tuning: practical guidance

Direct answer: Start conservatively (e.g., rate=1r/s, burst=5) and monitor logs; adjust rate and burst based on observed bot behavior and server capacity.

Central question: What values should I set for rate and burst initially, and how do I change them safely?
Short summary: use 1r/s as a practical starting point and reduce to 0.2r/s for more conservative throttling; increase burst if you expect short allowable surges.

Practical tuning tips (from the discussion)

  • Starting values

    • rate=1r/s with burst=5 is a common starting point.
    • To be more conservative, rate=0.2r/s approximates 1 request every 5 seconds.
  • Burst behavior

    • burst=5 allows occasional spikes without rejecting immediately.
    • Larger burst accepts more spikes; smaller burst tightens control.
  • Zone size

    • zone=ahrefs_zone:10m typically suffices; increase if you have many concurrent IPs.
  • nodelay

    • With nodelay, extra requests are rejected immediately (503).
    • Without nodelay, extra requests are delayed (queued) instead.

Monitoring adjustments

  • Start with the conservative settings and watch logs for false positives (legitimate clients being throttled).
  • Adjust if you see too many 503 responses or if bot traffic still overloads the server.

Testing and monitoring: concrete commands and checks

Direct answer: Use nginx -t then systemctl reload nginx, test with curl using different User-Agent strings, and monitor access logs filtered for relevant user-agents.

Central question: How do I verify the configuration works and monitor its effect?
Short summary: validate nginx configs, simulate requests, and use live log monitoring to observe user-agent behavior and throttling responses.

Commands to run

  • Test and reload nginx:
nginx -t && systemctl reload nginx
  • Simulate requests:
# Simulate normal user
curl -I -A "Mozilla/5.0" http://example.com/

# Simulate Ahrefs user-agent
curl -I -A "AhrefsBot" http://example.com/somepath
  • Watch logs for targeted user-agent entries:
tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
  • Find 503 responses (indicative of throttle rejections):
grep " 503 " /var/log/nginx/access.log | head

What to look for

  • Confirm that requests with Ahrefs UA are routed to the throttled internal location.
  • Look for 503 responses if throttling hits the limit with nodelay.
  • Verify that normal UA requests are unaffected.

CDN and reverse proxy considerations

Direct answer: If you place nginx behind a CDN or load balancer, make sure to use the real client IP (realip) so per-IP rate limiting works properly.

Central question: My nginx is behind a CDN — how does that affect the rate limiting?
Short summary: without the real client IP, rate limits might be applied per CDN node IP rather than per client, so configure real IP headers and trust only known CDN IP ranges.

Minimal real IP configuration

Place in http { ... }:

real_ip_header X-Forwarded-For;
set_real_ip_from 0.0.0.0/0;

Caution: in production, replace 0.0.0.0/0 with explicit CDN IP ranges to avoid trusting unverified sources.

Practical effect

  • With real IP configured, $binary_remote_addr refers to the true client IP, making the per-IP limiting effective.
  • Without it, throttling may appear ineffective or over-aggressive because many clients share the same CDN edge IPs.

Application scenarios and operational examples

Direct answer: Use robots.txt and nginx together to handle scenarios ranging from a benign indexing increase to emergency bot storms and regular scheduled crawling.

Central question: In what real situations should I pick robots.txt vs nginx emergency reject vs limit_req?
Short summary: For routine, low-impact changes, use robots.txt. For sudden load or immediate problems, use emergency reject. For sustained, controlled throttling while preserving access, use limit_req.

Scenario 1 — Content site with frequent updates

Problem: Content site publishes daily; third-party crawlers increase activity and cause spikes.
Response: Add Crawl-delay entries in robots.txt and monitor. If spikes persist, deploy limit_req with moderate rate (e.g., 1r/s) to smooth load.

Operational example:

  • Add to robots.txt:
User-agent: *
Allow: /

User-agent: AhrefsBot
Crawl-delay: 5
  • Deploy limit_req config with rate=1r/s, burst=5 and monitor for 24–48 hours.

Scenario 2 — Sudden bot storm causing timeouts

Problem: Third-party site audit tool starts scanning heavily and your endpoints timeout.
Response: Use emergency nginx snippet to return 429 + Retry-After (temporary) until you apply smoother throttling.

Operational example:

  • Emergency config returns 429 for matching UA and Retry-After: 120.
  • After traffic stabilizes, replace emergency rule with limit_req approach.

Scenario 3 — Large site with CDN in front

Problem: Many pages, high bot coverage. CDN hides real IP addresses.
Response: Configure realip correctly, ensure CDN forwards X-Forwarded-For, and use limit_req per client IP. Monitor CDN logs as well.

Operational example:

  • Ensure real_ip_header and set_real_ip_from are configured with CDN IP ranges.
  • Deploy limit_req with zone=10m and adjust based on server capacity.

Author reflection / lessons learned

Direct answer: Practical protection needs both polite cooperation and enforceable controls; in real incidents, combining robots.txt with server-side throttling saved services without long-term damage to monitoring or indexing.

Reflection: From the operational examples above, my experience is that robots.txt is the right first step — it documents intent and helps well-behaved crawlers — but the server is where you must keep control. The limit_req pattern with an internal rewrite provides a reliable balance: it preserves access while preventing overload. In emergencies, a short 429/503 with Retry-After buys you time to implement smoother controls.


Action Checklist / Implementation Steps

Direct answer: Follow these prioritized steps to go from detection to protection quickly and safely.

  1. Verify robots.txt is correct and public

    • curl -I https://yourdomain/robots.txt
  2. Check access logs for real UA strings

    • tail -f /var/log/nginx/access.log | grep -E "AhrefsBot|AhrefsSiteAudit"
  3. Purge CDN cache if applicable

    • Ensure the updated robots.txt is delivered.
  4. If urgent, deploy emergency reject

    • Add the map and if ($is_ahrefs) { add_header Retry-After "120"; return 429; } snippet, reload nginx.
  5. Deploy smooth limit_req

    • Add map, limit_req_zone in http {}, internal rewrite and throttled location in server {}.
  6. Test with curl and monitor logs

    • curl -I -A "AhrefsBot" http://example.com/somepath
    • Monitor access and error logs for 24–48 hours.
  7. Tune rate, burst, and zone size

    • Adjust to balance server load and crawler coverage.
  8. If behind CDN, configure real_ip_header and set_real_ip_from

    • Use CDN IP ranges, not 0.0.0.0/0 in production.
  9. Document the change and revert plan

    • Keep a short rollback plan in case of unexpected side effects.

One-page overview (quick reference)

Problem: A crawler is overloading the site and robots.txt changes don’t provide immediate relief.

Immediate response:

  • Emergency: return 429 or 503 with Retry-After for matching UA via nginx.

Recommended long-term:

  • Use limit_req with internal rewrite:

    • Detect UA via map $http_user_agent.
    • Define limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s.
    • Rewrite matched UA to an internal throttled location protected by limit_req.

Testing: nginx -t && systemctl reload nginx, then curl -I -A "AhrefsBot" ... and monitor logs.

CDN note: Ensure real client IP is visible to nginx (real_ip_header + set_real_ip_from) so throttling behaves per client IP.

Tuning start values: rate=1r/s, burst=5, zone=10m; conservative alternative rate=0.2r/s.


FAQ (concise answers based on article content)

Q1: I changed robots.txt but the crawler still hits hard; what do I check first?
A1: Verify robots.txt is publicly available, purge CDN cache, and inspect access logs to confirm the crawler’s exact UA string.

Q2: How do I immediately stop a crawler that is overwhelming my server?
A2: Use an emergency nginx rule that detects the crawler UA and returns 429 or 503 with a Retry-After header; then follow up with smoother throttling.

Q3: What is a safe starting limit_req configuration?
A3: limit_req_zone $binary_remote_addr zone=ahrefs_zone:10m rate=1r/s; with burst=5 is a reasonable start. Use rate=0.2r/s for more conservative throttling.

Q4: Will this throttle affect normal users?
A4: If UA detection is precise and real client IP is used, normal users should be unaffected. Monitor logs to catch any false positives.

Q5: What if my server is behind a CDN?
A5: Configure real_ip_header and set_real_ip_from so nginx sees the real client IP, and prefer explicit CDN IP ranges in set_real_ip_from.

Q6: How long until Crawl-delay in robots.txt is respected?
A6: It varies; some crawlers adjust within hours, others may take longer. Do not expect instant enforcement.

Q7: Which response code should I use for emergency rejection?
A7: 429 (Too Many Requests) is clear for rate limiting; 503 with Retry-After is also commonly used for temporary unavailability. Both are respected by well-behaved crawlers.

Q8: How do I test the change?
A8: Use curl with the crawler UA string and monitor nginx access logs for expected routing and response codes.