LittleCrawler: Run Once, Own the Data — An Async Python Framework for XHS, XHY, and Zhihu
“
What exactly is LittleCrawler?
It is a battery-included, open-source Python framework that uses Playwright, FastAPI and Next.js to scrape public posts, details and creator pages from Xiaohong-shu (RED), Xianyu (Idle Fish) and Zhihu in a single CLI or a point-and-click web console.
1. Why Yet Another Scraper?
Core question: “My one-off script breaks every month—how can I stop babysitting logins, storage and anti-bot changes?”
One-sentence answer: LittleCrawler moves those chores into pluggable modules so you spend time on data, not duct-tape.
1.1 Pain-points the author kept hitting
I maintain micro-services for three start-ups. Every quarter someone asks:
-
“Can we monitor RED for camping keywords?” -
“We need Xianyu price drops for camera gear.” -
“Please archive high-score Zhihu answers for our research team.”
Each request started as a 50-line script and ended in a fragile mess of hard-coded headers, expired cookies and broken XPath. Abstracting the common parts (login, storage, rate-limit, UI) produced LittleCrawler.
1.2 Where it fits
| Scenario | Platform | Type | Output | Value |
|---|---|---|---|---|
| Trend spotting | RED | search | JSON | Discover next week’s hot topic before it peaks |
| Price intelligence | Xianyu | search | Excel | Daily price mean & median for any SKU |
| Content backup | Zhihu | creator | SQLite | Full off-platform copy of your answers |
2. Architecture at a Glance
Core question: “How is the code organised and why should I trust it will scale from 100 to 100 000 records?”
One-sentence answer: Async Python 3.11 + Playwright delivers head-browser concurrency while a clean storage interface lets you swap CSV for MongoDB with zero code changes.
2.1 Async pipeline
-
asynciotask limiter keeps 20–50 browser contexts per 4-core VM -
Each context fetches, renders and parses in <3 s; RAM stays under 300 MB -
Results stream into storage through async/await—no blocking I/O
2.2 Anti-detection tricks baked in
| Feature | Default | Purpose |
|---|---|---|
| CDP mode | ON | Talk to Chromium via DevTools Protocol—no WebDriver flag |
| Fingerprint randomiser | ON | User-Agent, WebGL noise, viewport resize every session |
| Proxy rotator | OFF* | Optional IP rotation via services/proxy/ |
*Set ENABLE_IP_PROXY=True and feed a text file of http://user:pass@ip:port lines.
2.3 Storage abstraction
All adapters inherit BaseStorage and implement connect(), insert(), close().
Switching from JSON to MySQL is a one-word change in base_config.py; no business logic touches SQL.
3. Ten-minute Quick-start
Core question: “I have a clean Ubuntu 22.04 box—how long until I see real data?”
One-sentence answer: Four commands and one QR-code scan; you’ll have 1 000 RED posts in under five minutes.
3.1 Install
# 1. Get code
git clone https://github.com/pbeenig/LittleCrawler.git
cd LittleCrawler
# 2. Dependencies (uv is faster than pip)
uv sync # or: pip install -r requirements.txt
playwright install chromium
3.2 Minimum viable config
Edit config/base_config.py:
PLATFORM = "xhs" # RED
KEYWORDS = "camping,coffee"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "json"
3.3 Run
python main.py
A Chromium window opens, shows a QR code; scan with RED app.
Terminal prints:
[INFO] 2026-01-10 11:12:03 | Login succeeded
[INFO] 2026-01-10 11:12:04 | Crawling 0–20 ...
After 1 000 posts you’ll find data/xhs_search_20260110_111503.json.
Author’s reflection
The first time I ran this I picked CSV output because “everyone loves Excel”. Sorting 30 000 rows later taught me a lesson—start with SQLite even for prototypes; migrations are free.
4. Web Console: Click, Run, Download
Core question: “Can my non-engineering teammate launch a crawl and download Excel?”
One-sentence answer: Yes—compile the Next.js front-end once and the UI guides her through login, keyword entry and file export.
4.1 Build & launch
# build front-end into api/ui
cd web && npm run build
# start API + static pages
uv run uvicorn api.main:app --port 8080 --reload
Visit http://127.0.0.1:8080 → three screens:
-
Login – QR or Cookie -
Run – dropdown for platform/type/keyword -
Data – live log + download button when finished
4.2 Dev mode (API only)
If you already have a separate front-end team:
API_ONLY=1 uv run uvicorn api.main:app --port 8080 --reload
cd web && npm run dev # hot-reload for UI tweaks
4.3 UI preview
| Screen | What you see | Why it matters |
|---|---|---|
| Login | QR code refreshes every 5 s | No need to restart if it times out |
| Run | progress bar + log tail | Phone-friendly; you can monitor while commuting |
| Export | buttons for JSON/Excel/SQLite | Same data, multiple formats without rerunning |
5. Deep Dive: Three Real Workloads
Core question: “Show me complete examples—config, run, gotcha, result.”
One-sentence answer: Below are verbatim configs and author notes from recent production jobs.
5.1 Use-case A – RED camping trend (30-day daily pull)
Goal Top 1 000 newest posts for keyword “camping”, compute daily like-growth.
Config
PLATFORM = "xhs"
KEYWORDS = "camping"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "sqlite"
Notes
-
Added column insert_date DEFAULT (date('now')) -
Next run skips duplicates with
WHERE note_id NOT IN (SELECT note_id FROM xhs WHERE insert_date = date('now','-1 day')) -
SQLite query to build top-50 ranking exported to Markdown and pasted into Feishu group every morning.
5.2 Use-case B – Xianyu price alert for Fuji X100V
Goal Email when 7-day median < CNY 4 500.
Config
PLATFORM = "xhy"
KEYWORDS = "富士X100V"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "mongodb"
ENABLE_IP_PROXY = True
Notes
-
MongoDB aggregation pipeline calculates rolling median. -
A 5-line Python helper watches the result collection and calls SendGrid if condition met. -
Without proxy you hit captcha after ~200 requests; with proxy pool (free tier 50 IPs) we routinely collect 5 000 items/day.
5.3 Use-case C – Zhihu high-score answer export
Goal 500 top-voted answers from topic “2026 Spring Festival travel” in a single Excel file for offline reading on a train.
Config
PLATFORM = "zhihu"
KEYWORDS = "2026春运"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "excel"
Notes
-
Excel column =HYPERLINK("https://www.zhihu.com/question/"&A2,"open")gives one-click jump. -
Zhihu rate-limit is toughest; we use cookie login ( z_c0) and exponential back-off built into the base crawler (max 5 retries, 30 s → 5 min). -
Headless runs fine on a 2-core cloud box; 500 answers ≈ 12 min.
6. Code Walk-through: Where to Change What
Core question: “I need to add TikTok support and store to BigQuery—where do I poke?”
One-sentence answer: Add a folder under platforms/ for crawler logic and a file under storage/ for BigQuery adapter—each <200 lines thanks to base classes.
6.1 Project tree (annotated)
LittleCrawler
├── main.py # CLI entry, argparse
├── config/base_config.py # single source of truth
├── src
│ ├── core
│ │ ├── base_crawler.py # defines async run(), login(), save()
│ │ └── context.py # keeps one browser instance per task
│ ├── platforms # one package per site
│ │ ├── xhs/search.py # RED search strategy
│ │ ├── xhy/search.py # Xianyu search strategy
│ │ └── zhihu/search.py # Zhihu search strategy
│ ├── storage # plug-ins
│ │ ├── sqlite_storage.py
│ │ ├── mongodb_storage.py
│ │ └── excel_storage.py
│ └── models # Pydantic schemas for type safety
├── api/main.py # FastAPI routes
├── web # Next.js front-end source
└── libs # third-party JS injected into pages
6.2 Adding a new platform (example skeleton)
Create platforms/tiktok/crawler.py:
from src.core.base_crawler import BaseCrawler
class TikTokCrawler(BaseCrawler):
async def login(self): ...
async def crawl(self): ...
Register in platforms/__init__.py and add "tiktok" to the UI dropdown—done.
6.3 Adding a new storage backend
Create storage/bigquery_storage.py:
from src.storage.base_storage import BaseStorage
class BigQueryStorage(BaseStorage):
async def connect(self): ...
async def insert(self, items): ...
async def close(self): ...
Set SAVE_DATA_OPTION = "bigquery" in config; no other code changes.
7. Troubleshooting Cheat-sheet
Core question: “Captcha, 403, broken XPath—what do I do quickly?”
One-sentence answer: Use built-in logs, captcha folder, retry policy and proxy toggles before touching code.
| Symptom | Quick fix |
|---|---|
| QR code shows “environment abnormal” | Enable proxy + CDP; restart browser |
| 403 after 200 requests (Xianyu) | Set ENABLE_IP_PROXY=True |
| MongoDB inserts slow | Batch size auto 100; increase in insert() if needed |
| Excel corrupt on Chinese title | openpyxl used by default; set quote_prefix=True |
| Playwright timeout | Lower concurrency in base_config.py (MAX_CONCURRENCY) |
8. Practical Take-away & Action Checklist
-
Install: uv sync && playwright install chromium -
Tweak four lines in config/base_config.py -
Run: python main.py(CLI) oruvicorn api.main:app(web) -
Scan QR / paste cookie; data lands in data/or your DB -
Switch storage by changing one string—no refactoring -
Schedule with cron or GitHub Actions -
Extend platforms/storage by inheriting base classes—≈30 min job
One-page Overview
LittleCrawler is an MIT-licensed async framework that couples Playwright-driven browsers with pluggable storage (CSV → BigQuery) and a FastAPI/Next.js console. Out-of-the-box it logs in, circumvents basic anti-bot checks and dumps structured data for RED, Xianyu and Zhihu. Scaling from a laptop prototype to a cloud cron job requires only config edits; adding new sources or sinks is a matter of writing one small class. The goal: let engineers analyse data instead of maintaining fragile scrapers.
FAQ
-
Does it run on Windows or macOS?
Yes—any OS that runs Python 3.11 and Chromium. -
How long does the cookie/QR session last?
RED ≈30 days, Zhihu 7 days, Xianyu 2–3 days; the UI shows expiry and lets you re-auth instantly. -
Can I feed thousands of keywords?
Comma-separate inKEYWORDSor start multiple docker containers with different config files. -
Is headless mode supported?
Absolutely—setHEADLESS=True; captchas are saved tologs/captcha/for manual solving. -
Do I violate ToS?
The tool accesses only publicly visible pages and respects built-in delays; still, you must comply with each platform’s terms and local laws. -
Why SQLite over CSV for prototypes?
Relational queries, indexes and upserts are free, and you can export to Excel later with one command. -
Can I use my existing proxy pool?
Yes—drop a text file withhttp://user:pass@ip:portlines and setENABLE_IP_PROXY=True. -
How hard is it to add TikTok, Instagram or Amazon?
Create a package underplatforms/, inheritBaseCrawler, implement three methods; most developers finish in under two hours.
