Scrapling: The Python Web Scraping Framework That Survives Website Redesigns
You spent hours building a scraper. It worked perfectly. Then the target site updated its layout, and every CSS selector broke overnight. If you’ve done any amount of web scraping, that story is painfully familiar. Scrapling was built to make it a thing of the past.
Table of Contents
-
What Is Scrapling? -
The Three Problems It Actually Solves -
Core Modules Explained -
How Fast Is It? Benchmarks -
Installation Guide -
Code Examples: From Basics to Production -
CLI Tools: Scrape Without Writing Code -
Using Scrapling With AI: MCP Server Mode -
Frequently Asked Questions -
Responsible Use
What Is Scrapling?
Scrapling is an adaptive web scraping framework written in Python. It handles everything from a single HTTP request to large-scale concurrent crawls — and it does so without requiring you to stitch together half a dozen different libraries.
The word that defines it best is adaptive.
Traditional scrapers depend on fixed CSS selectors or XPath expressions to locate elements. The moment a website rearranges its HTML, those paths break. You fix them manually, re-deploy, and hope the site doesn’t change again next week. Scrapling’s parser takes a different approach: it uses a similarity algorithm to automatically relocate elements after a site redesign, without requiring you to touch the code.
Beyond adaptive parsing, Scrapling ships with:
-
Fetchers that bypass anti-bot systems like Cloudflare Turnstile out of the box -
A Scrapy-style Spider framework with concurrency, multi-session routing, and pause/resume support -
A CLI tool that lets you extract web data directly from the terminal, no Python required -
A built-in MCP server for integrating web scraping capabilities into AI workflows with tools like Claude or Cursor
The goal is one library that handles the full scraping pipeline — no compromises.
The Three Problems It Actually Solves
Before getting into feature details, it’s worth being specific about the real-world pain points Scrapling addresses.
Problem 1: Your scraper breaks every time the site changes
This is the most common frustration in web scraping. CSS class names get renamed, DOM hierarchies shift, and IDs disappear. Scrapling’s adaptive element tracking saves the structural fingerprint of an element the first time you find it. If the site redesigns, you pass adaptive=True and the parser finds the equivalent element in the new layout:
# First run: save the element's structural fingerprint
products = page.css('.product', auto_save=True)
# After a site redesign: find them again automatically
products = page.css('.product', adaptive=True)
Two parameters. No manual selector hunting.
Problem 2: Anti-bot systems block your requests
Many sites use Cloudflare or similar services to detect and block scrapers. Scrapling’s StealthyFetcher handles this by mimicking a real browser’s TLS fingerprint and request headers, and automatically resolving Cloudflare Turnstile challenges. It works out of the box — no extra configuration needed.
Problem 3: Scaling a single script into a real crawler is painful
Growing from “scrape one page” to “scrape thousands of pages concurrently” usually means rebuilding from scratch. Scrapling’s Spider framework provides concurrency controls, multi-session routing, checkpoint-based pause/resume, and real-time streaming — ready to use, not something you have to implement yourself.
Core Modules Explained
Spider Framework
The Spider module is designed for production-scale crawls. If you’ve used Scrapy, the API will feel immediately familiar.
Concurrent crawling is configurable via concurrent_requests. You can set per-domain throttling and download delays to avoid hammering target servers.
Multi-session support lets you use different fetching strategies within a single Spider. Route standard pages to a fast HTTP session and bot-protected pages to a stealthy headless browser session — all managed through session IDs in a unified interface.
Pause and resume works through checkpoints. Pass a crawldir to your Spider, press Ctrl+C to gracefully stop, and restart with the same crawldir to pick up exactly where you left off. No lost progress, no re-crawling already-visited URLs.
Streaming mode lets you process results in real time using async for item in spider.stream(), rather than waiting for the entire crawl to complete. This is ideal for feeding live pipelines or updating a UI while the crawl is running.
Blocked request detection automatically identifies requests that were rejected by the target site and supports custom retry logic.
Built-in export via result.items.to_json() and result.items.to_jsonl() — no extra setup required.
Fetchers
Scrapling offers three fetcher types, each targeting a distinct scenario:
| Fetcher | Best For | Key Capabilities |
|---|---|---|
Fetcher / FetcherSession |
Fast HTTP requests, speed-first | TLS fingerprint spoofing, HTTP/3 support |
StealthyFetcher / StealthySession |
Anti-bot protected sites | Advanced browser fingerprinting, automatic Cloudflare bypass |
DynamicFetcher / DynamicSession |
JavaScript-rendered pages | Full Playwright-based browser automation (Chromium / Chrome) |
Every fetcher has a corresponding Session class. Sessions persist cookies and browser state across requests, which is essential for workflows that require login or multi-step navigation.
All fetchers also have fully async variants (AsyncStealthySession, AsyncDynamicSession, etc.) for high-concurrency pipelines built on asyncio.
Adaptive Parser
The parser is where Scrapling’s technical depth shows. It supports multiple element selection methods:
-
CSS selectors: page.css('.quote') -
XPath: page.xpath('//div[@class="quote"]') -
BeautifulSoup-style: page.find_all('div', class_='quote') -
Text-based search: page.find_by_text('quote', tag='div') -
Automatic similar-element discovery: element.find_similar()
The parser can also run standalone — you don’t need a Fetcher to use it. Pass an HTML string directly:
from scrapling.parser import Selector
page = Selector("<html>...</html>")
# Full API available, identical to working with a Fetcher response
DOM navigation is equally rich:
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text') # Adjacent sibling
parent = first_quote.parent # Parent node
similar = first_quote.find_similar() # Elements with similar structure
below = first_quote.below_elements() # Elements below in the document
Proxy Rotation
The built-in ProxyRotator supports round-robin and custom rotation strategies, works with all session types, and allows per-request proxy overrides when you need them. Browser-based fetchers also support domain blocking — useful for suppressing ad trackers or analytics scripts during a crawl.
How Fast Is It? Benchmarks
Speed matters when you’re scraping at scale. The benchmarks below compare Scrapling’s parser against current versions of popular Python scraping libraries, using text extraction on 5,000 nested elements. Each result is averaged over 100+ runs.
Text Extraction Speed
| Rank | Library | Time (ms) | vs. Scrapling |
|---|---|---|---|
| 1 | Scrapling | 2.02 | baseline |
| 2 | Parsel / Scrapy | 2.04 | ~1.01× slower |
| 3 | Raw Lxml | 2.54 | ~1.26× slower |
| 4 | PyQuery | 24.17 | ~12× slower |
| 5 | Selectolax | 82.63 | ~41× slower |
| 6 | MechanicalSoup | 1549.71 | ~767× slower |
| 7 | BS4 + Lxml | 1584.31 | ~784× slower |
| 8 | BS4 + html5lib | 3391.91 | ~1679× slower |
Scrapling and Parsel/Scrapy are essentially tied at the top. PyQuery is 12× slower. BeautifulSoup with Lxml is nearly 800× slower.
Adaptive Element Search Speed
| Library | Time (ms) | vs. Scrapling |
|---|---|---|
| Scrapling | 2.39 | baseline |
| AutoScraper | 12.45 | ~5.2× slower |
For similarity-based element retrieval, Scrapling is over five times faster than AutoScraper.
Beyond raw speed, there are a few other performance characteristics worth noting:
-
Low memory footprint through optimized data structures and lazy loading -
JSON serialization that runs 10× faster than Python’s standard library -
92% test coverage with full type hints across the codebase, verified automatically with PyRight and MyPy on every commit -
Real-world validation: the library has been used daily by hundreds of developers for over a year
Installation Guide
Scrapling requires Python 3.10 or higher.
Parser Only (Minimal Install)
pip install scrapling
This installs only the parsing engine and its dependencies — no Fetchers, no CLI tools. Use this if you only need to parse HTML you’ve already obtained.
Fetchers + Browser Dependencies
To use StealthyFetcher, DynamicFetcher, or their Session counterparts:
pip install "scrapling[fetchers]"
scrapling install
The second command downloads the required browsers (Chromium/Chrome), their system dependencies, and the fingerprinting libraries.
Optional Feature Installs
| Feature | Command |
|---|---|
| MCP server for AI integration | pip install "scrapling[ai]" |
Interactive shell + extract CLI command |
pip install "scrapling[shell]" |
| Everything | pip install "scrapling[all]" |
Note: After installing any of the above extras, run
scrapling installif you haven’t already — it ensures browser dependencies are fully in place.
Docker
If you’d rather skip local browser configuration entirely, pull the official Docker image. It includes all features and browsers pre-installed:
# From DockerHub
docker pull pyd4vinci/scrapling
# Or from the GitHub Container Registry
docker pull ghcr.io/d4vinci/scrapling:latest
The image is built and pushed automatically via GitHub Actions on every release, so it always tracks the latest version.
Code Examples: From Basics to Production
Example 1: Standard HTTP Scraping
The simplest case — a GET request and CSS selector extraction:
from scrapling.fetchers import Fetcher, FetcherSession
# One-off request
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
# Session-based (better for multi-step workflows)
with FetcherSession(impersonate='chrome') as session:
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
impersonate='chrome' tells the fetcher to use Chrome’s current TLS fingerprint, which is enough to get past many basic bot detection systems.
Example 2: Bypassing Cloudflare
from scrapling.fetchers import StealthyFetcher, StealthySession
# One-off: opens a browser, completes the request, closes automatically
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
# Session mode: browser stays open for multiple requests
with StealthySession(headless=True, solve_cloudflare=True) as session:
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()
Example 3: JavaScript-Rendered Pages
When data is loaded dynamically via JavaScript, you need a full browser render:
from scrapling.fetchers import DynamicFetcher
# Wait for all network requests to settle before parsing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/', network_idle=True)
data = page.css('.quote .text::text').getall()
# XPath works just as well
data = page.xpath('//span[@class="text"]/text()').getall()
Example 4: Building a Full Spider
Scrapy users will feel at home here. This Spider handles pagination, runs 10 concurrent requests, and exports results to JSON:
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")
Example 5: Pause and Resume Long Crawls
For large jobs that might need to be interrupted:
QuotesSpider(crawldir="./crawl_data").start()
Press Ctrl+C at any point. Progress is saved automatically. Restart with the same crawldir and the Spider continues from the last checkpoint.
Example 6: Mixing Multiple Session Types in One Spider
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth") # Route to stealthy browser
else:
yield Request(link, sid="fast", callback=self.parse) # Keep using fast HTTP
Example 7: Async Concurrent Fetching
import asyncio
from scrapling.fetchers import AsyncStealthySession
async def main():
async with AsyncStealthySession(max_pages=2) as session:
urls = ['https://example.com/page1', 'https://example.com/page2']
tasks = [session.fetch(url) for url in urls]
print(session.get_pool_stats()) # Check tab pool status: busy / idle / error
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())
asyncio.run(main())
Example 8: Full Adaptive Element Tracking Workflow
from scrapling.fetchers import StealthyFetcher
# Enable adaptive mode globally
StealthyFetcher.adaptive = True
# First run: fingerprint the elements
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
products = page.css('.product', auto_save=True)
# --- After a site redesign ---
# Scrapling finds the equivalent elements in the new layout
products = page.css('.product', adaptive=True)
CLI Tools: Scrape Without Writing Code
Scrapling ships with a command-line interface that lets you extract content from any URL directly in the terminal — no Python script needed.
Launch the interactive scraping shell (requires scrapling[shell]):
scrapling shell
This opens an IPython environment with Scrapling pre-loaded. It can convert curl commands into Scrapling requests and preview results in the browser — useful for testing selectors before writing a full script.
Extract page content from the command line:
# Extract full page as Markdown
scrapling extract get 'https://example.com' content.md
# Extract plain text
scrapling extract get 'https://example.com' content.txt
# Target a specific element with a CSS selector, impersonating Chrome
scrapling extract get 'https://example.com' content.txt \
--css-selector '#fromSkipToProducts' --impersonate 'chrome'
# Use a headless browser for dynamic pages
scrapling extract fetch 'https://example.com' content.md \
--css-selector '#fromSkipToProducts' --no-headless
# Bypass Cloudflare and extract specific content
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html \
--css-selector '#padded_content a' --solve-cloudflare
The output format is determined by the file extension you provide:
-
.txt→ plain text content of the selected elements -
.md→ HTML converted to Markdown -
.html→ raw HTML content
Using Scrapling With AI: MCP Server Mode
Scrapling includes a built-in MCP (Model Context Protocol) server, which lets AI tools like Claude and Cursor call Scrapling’s scraping and parsing capabilities directly.
The practical advantage here is significant. When an AI needs to process a web page, passing the entire raw HTML to a large language model is slow and token-expensive. With the MCP server, Scrapling first extracts only the relevant content from the page, then passes that focused, compact output to the AI. This reduces response latency and cuts token usage considerably — which matters both for cost and context-window efficiency.
Install the MCP server feature:
pip install "scrapling[ai]"
Frequently Asked Questions
How is Scrapling different from Scrapy?
Scrapy is a mature, full-featured framework — comprehensive but heavy, with a steep learning curve. Scrapling positions itself as a more modern, lighter-weight alternative. It adds capabilities Scrapy doesn’t have natively: adaptive element tracking, built-in anti-bot bypass, and AI integration. The Spider API is intentionally designed to feel familiar to Scrapy users, so migration is low-effort.
Can StealthyFetcher bypass all anti-bot systems?
According to the documentation, StealthyFetcher automatically handles all variants of Cloudflare Turnstile and Interstitial challenges out of the box. Effectiveness against other anti-bot systems depends on their specific implementation.
How does adaptive element tracking actually work?
When you extract an element with auto_save=True, Scrapling saves its structural characteristics — surrounding text, DOM hierarchy, attribute patterns, and relative position. The next time you use adaptive=True, the parser runs a similarity algorithm against the new page structure to find the best match, rather than relying on a hardcoded path.
Can I use Scrapling just for HTML parsing, without making any requests?
Yes. The parser module works independently:
from scrapling.parser import Selector
page = Selector("<html>your HTML here</html>")
# Full selector API available — identical behavior to a Fetcher response
What’s the difference between a Session and a one-off request?
Session classes (FetcherSession, StealthySession, etc.) persist cookies, headers, and browser state across multiple requests. They’re the right choice for workflows that require login, multi-step navigation, or consistent identity. One-off calls (Fetcher.get(), StealthyFetcher.fetch()) are stateless — each request is independent, which is fine for simple single-page extraction.
Can I use custom proxies with Scrapling?
Yes. The built-in ProxyRotator supports round-robin and custom strategies, works with all session types, and can be overridden on a per-request basis.
What does the Docker image include?
The official image includes all optional features and pre-installed browsers (Chromium and others). It’s built and pushed automatically via GitHub Actions on every release, so it stays in sync with the latest code.
Responsible Use
A few things worth keeping in mind when using any web scraping tool:
Respect robots.txt and Terms of Service. Before scraping a site, check https://example.com/robots.txt to understand which paths are allowed. Read the site’s Terms of Service to see if data collection is explicitly restricted.
Set reasonable rate limits. Even when you technically can send thousands of concurrent requests, configuring sensible download delays and concurrency limits is the right thing to do. Scrapling’s Spider framework has per-domain throttling built in — use it.
Stay compliant with applicable data laws. How you store and use scraped data is subject to regulations that vary by jurisdiction — GDPR in Europe, and various national privacy laws elsewhere. Know what applies to your use case.
The library is a tool, not a permission slip. Scrapling is built for educational and research purposes. Users are responsible for their own actions. The developers and contributors accept no liability for misuse.
Summary
Scrapling is designed around three realities of modern web scraping: websites change, anti-bot systems exist, and projects grow.
It’s not trying to replace every scraping tool out there. But it’s a strong fit when:
-
You maintain long-running data pipelines that can’t afford to break every time a site redesigns -
You need to scrape sites protected by Cloudflare or similar systems, without building a custom bypass -
You’re scaling from a single-page script to a concurrent, multi-session crawler and want a structured framework to build on -
You’re integrating web content extraction into an AI workflow and need an efficient middle layer
For Python developers, data engineers, and researchers who do regular web scraping, it’s a library worth having in your toolkit.
GitHub: D4Vinci/Scrapling
Full Documentation: scrapling.readthedocs.io
Install: pip install scrapling
This article is based on the official Scrapling documentation. For the most current technical details, refer to the project’s latest release.
