LittleCrawler Python Framework: Master XHS, Xianyu & Zhihu Scraping in Minutes

高效码农

2 months ago

LittleCrawler: Run Once, Own the Data — An Async Python Framework for XHS, XHY, and Zhihu

“

What exactly is LittleCrawler?
It is a battery-included, open-source Python framework that uses Playwright, FastAPI and Next.js to scrape public posts, details and creator pages from Xiaohong-shu (RED), Xianyu (Idle Fish) and Zhihu in a single CLI or a point-and-click web console.

1. Why Yet Another Scraper?

Core question: “My one-off script breaks every month—how can I stop babysitting logins, storage and anti-bot changes?”
One-sentence answer: LittleCrawler moves those chores into pluggable modules so you spend time on data, not duct-tape.

1.1 Pain-points the author kept hitting

I maintain micro-services for three start-ups. Every quarter someone asks:

“Can we monitor RED for camping keywords?”
“We need Xianyu price drops for camera gear.”
“Please archive high-score Zhihu answers for our research team.”

Each request started as a 50-line script and ended in a fragile mess of hard-coded headers, expired cookies and broken XPath. Abstracting the common parts (login, storage, rate-limit, UI) produced LittleCrawler.

1.2 Where it fits

Scenario	Platform	Type	Output	Value
Trend spotting	RED	search	JSON	Discover next week’s hot topic before it peaks
Price intelligence	Xianyu	search	Excel	Daily price mean & median for any SKU
Content backup	Zhihu	creator	SQLite	Full off-platform copy of your answers

2. Architecture at a Glance

Core question: “How is the code organised and why should I trust it will scale from 100 to 100 000 records?”
One-sentence answer: Async Python 3.11 + Playwright delivers head-browser concurrency while a clean storage interface lets you swap CSV for MongoDB with zero code changes.

2.1 Async pipeline

asyncio task limiter keeps 20–50 browser contexts per 4-core VM
Each context fetches, renders and parses in <3 s; RAM stays under 300 MB
Results stream into storage through async/await—no blocking I/O

2.2 Anti-detection tricks baked in

Feature	Default	Purpose
CDP mode	ON	Talk to Chromium via DevTools Protocol—no WebDriver flag
Fingerprint randomiser	ON	User-Agent, WebGL noise, viewport resize every session
Proxy rotator	OFF*	Optional IP rotation via `services/proxy/`

*Set ENABLE_IP_PROXY=True and feed a text file of http://user:pass@ip:port lines.

2.3 Storage abstraction

All adapters inherit BaseStorage and implement connect(), insert(), close().
Switching from JSON to MySQL is a one-word change in base_config.py; no business logic touches SQL.

3. Ten-minute Quick-start

Core question: “I have a clean Ubuntu 22.04 box—how long until I see real data?”
One-sentence answer: Four commands and one QR-code scan; you’ll have 1 000 RED posts in under five minutes.

3.1 Install

# 1. Get code
git clone https://github.com/pbeenig/LittleCrawler.git
cd LittleCrawler

# 2. Dependencies (uv is faster than pip)
uv sync                 # or: pip install -r requirements.txt
playwright install chromium

3.2 Minimum viable config

Edit config/base_config.py:

PLATFORM = "xhs"           # RED
KEYWORDS = "camping,coffee"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "json"

3.3 Run

python main.py

A Chromium window opens, shows a QR code; scan with RED app.
Terminal prints:

[INFO] 2026-01-10 11:12:03 | Login succeeded
[INFO] 2026-01-10 11:12:04 | Crawling 0–20 ...

After 1 000 posts you’ll find data/xhs_search_20260110_111503.json.

Author’s reflection
The first time I ran this I picked CSV output because “everyone loves Excel”. Sorting 30 000 rows later taught me a lesson—start with SQLite even for prototypes; migrations are free.

4. Web Console: Click, Run, Download

Core question: “Can my non-engineering teammate launch a crawl and download Excel?”
One-sentence answer: Yes—compile the Next.js front-end once and the UI guides her through login, keyword entry and file export.

4.1 Build & launch

# build front-end into api/ui
cd web && npm run build

# start API + static pages
uv run uvicorn api.main:app --port 8080 --reload

Visit http://127.0.0.1:8080 → three screens:

Login – QR or Cookie
Run – dropdown for platform/type/keyword
Data – live log + download button when finished

4.2 Dev mode (API only)

If you already have a separate front-end team:

API_ONLY=1 uv run uvicorn api.main:app --port 8080 --reload
cd web && npm run dev      # hot-reload for UI tweaks

4.3 UI preview

Screen	What you see	Why it matters
Login	QR code refreshes every 5 s	No need to restart if it times out
Run	progress bar + log tail	Phone-friendly; you can monitor while commuting
Export	buttons for JSON/Excel/SQLite	Same data, multiple formats without rerunning

5. Deep Dive: Three Real Workloads

Core question: “Show me complete examples—config, run, gotcha, result.”
One-sentence answer: Below are verbatim configs and author notes from recent production jobs.

5.1 Use-case A – RED camping trend (30-day daily pull)

Goal Top 1 000 newest posts for keyword “camping”, compute daily like-growth.
Config

PLATFORM = "xhs"
KEYWORDS = "camping"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "sqlite"

Notes

Added column insert_date DEFAULT (date('now'))
Next run skips duplicates with
WHERE note_id NOT IN (SELECT note_id FROM xhs WHERE insert_date = date('now','-1 day'))
SQLite query to build top-50 ranking exported to Markdown and pasted into Feishu group every morning.

5.2 Use-case B – Xianyu price alert for Fuji X100V

Goal Email when 7-day median < CNY 4 500.
Config

PLATFORM = "xhy"
KEYWORDS = "富士X100V"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "mongodb"
ENABLE_IP_PROXY = True

Notes

MongoDB aggregation pipeline calculates rolling median.
A 5-line Python helper watches the result collection and calls SendGrid if condition met.
Without proxy you hit captcha after ~200 requests; with proxy pool (free tier 50 IPs) we routinely collect 5 000 items/day.

5.3 Use-case C – Zhihu high-score answer export

Goal 500 top-voted answers from topic “2026 Spring Festival travel” in a single Excel file for offline reading on a train.
Config

PLATFORM = "zhihu"
KEYWORDS = "2026春运"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "excel"

Notes

Excel column =HYPERLINK("https://www.zhihu.com/question/"&A2,"open") gives one-click jump.
Zhihu rate-limit is toughest; we use cookie login (z_c0) and exponential back-off built into the base crawler (max 5 retries, 30 s → 5 min).
Headless runs fine on a 2-core cloud box; 500 answers ≈ 12 min.

6. Code Walk-through: Where to Change What

Core question: “I need to add TikTok support and store to BigQuery—where do I poke?”
One-sentence answer: Add a folder under platforms/ for crawler logic and a file under storage/ for BigQuery adapter—each <200 lines thanks to base classes.

6.1 Project tree (annotated)

LittleCrawler
├── main.py                  # CLI entry, argparse
├── config/base_config.py    # single source of truth
├── src
│   ├── core
│   │   ├── base_crawler.py  # defines async run(), login(), save()
│   │   └── context.py       # keeps one browser instance per task
│   ├── platforms            # one package per site
│   │   ├── xhs/search.py    # RED search strategy
│   │   ├── xhy/search.py    # Xianyu search strategy
│   │   └── zhihu/search.py  # Zhihu search strategy
│   ├── storage              # plug-ins
│   │   ├── sqlite_storage.py
│   │   ├── mongodb_storage.py
│   │   └── excel_storage.py
│   └── models               # Pydantic schemas for type safety
├── api/main.py              # FastAPI routes
├── web                      # Next.js front-end source
└── libs                     # third-party JS injected into pages

6.2 Adding a new platform (example skeleton)

Create platforms/tiktok/crawler.py:

from src.core.base_crawler import BaseCrawler
class TikTokCrawler(BaseCrawler):
    async def login(self): ...
    async def crawl(self): ...

6.3 Adding a new storage backend

Create storage/bigquery_storage.py:

from src.storage.base_storage import BaseStorage
class BigQueryStorage(BaseStorage):
    async def connect(self): ...
    async def insert(self, items): ...
    async def close(self): ...

Set SAVE_DATA_OPTION = "bigquery" in config; no other code changes.

7. Troubleshooting Cheat-sheet

Core question: “Captcha, 403, broken XPath—what do I do quickly?”
One-sentence answer: Use built-in logs, captcha folder, retry policy and proxy toggles before touching code.

Symptom	Quick fix
QR code shows “environment abnormal”	Enable proxy + CDP; restart browser
403 after 200 requests (Xianyu)	Set `ENABLE_IP_PROXY=True`
MongoDB inserts slow	Batch size auto 100; increase in `insert()` if needed
Excel corrupt on Chinese title	`openpyxl` used by default; set `quote_prefix=True`
Playwright timeout	Lower concurrency in `base_config.py` (`MAX_CONCURRENCY`)

8. Practical Take-away & Action Checklist

Install: uv sync && playwright install chromium
Tweak four lines in config/base_config.py
Run: python main.py (CLI) or uvicorn api.main:app (web)
Scan QR / paste cookie; data lands in data/ or your DB
Switch storage by changing one string—no refactoring
Schedule with cron or GitHub Actions
Extend platforms/storage by inheriting base classes—≈30 min job

One-page Overview

LittleCrawler is an MIT-licensed async framework that couples Playwright-driven browsers with pluggable storage (CSV → BigQuery) and a FastAPI/Next.js console. Out-of-the-box it logs in, circumvents basic anti-bot checks and dumps structured data for RED, Xianyu and Zhihu. Scaling from a laptop prototype to a cloud cron job requires only config edits; adding new sources or sinks is a matter of writing one small class. The goal: let engineers analyse data instead of maintaining fragile scrapers.

FAQ

Does it run on Windows or macOS?
Yes—any OS that runs Python 3.11 and Chromium.
How long does the cookie/QR session last?
RED ≈30 days, Zhihu 7 days, Xianyu 2–3 days; the UI shows expiry and lets you re-auth instantly.
Can I feed thousands of keywords?
Comma-separate in KEYWORDS or start multiple docker containers with different config files.
Is headless mode supported?
Absolutely—set HEADLESS=True; captchas are saved to logs/captcha/ for manual solving.
Do I violate ToS?
The tool accesses only publicly visible pages and respects built-in delays; still, you must comply with each platform’s terms and local laws.
Why SQLite over CSV for prototypes?
Relational queries, indexes and upserts are free, and you can export to Excel later with one command.
Can I use my existing proxy pool?
Yes—drop a text file with http://user:pass@ip:port lines and set ENABLE_IP_PROXY=True.
How hard is it to add TikTok, Instagram or Amazon?
Create a package under platforms/, inherit BaseCrawler, implement three methods; most developers finish in under two hours.