Site icon Efficient Coder

LittleCrawler Python Framework: Master XHS, Xianyu & Zhihu Scraping in Minutes

LittleCrawler: Run Once, Own the Data — An Async Python Framework for XHS, XHY, and Zhihu

What exactly is LittleCrawler?
It is a battery-included, open-source Python framework that uses Playwright, FastAPI and Next.js to scrape public posts, details and creator pages from Xiaohong-shu (RED), Xianyu (Idle Fish) and Zhihu in a single CLI or a point-and-click web console.


1. Why Yet Another Scraper?

Core question: “My one-off script breaks every month—how can I stop babysitting logins, storage and anti-bot changes?”
One-sentence answer: LittleCrawler moves those chores into pluggable modules so you spend time on data, not duct-tape.

1.1 Pain-points the author kept hitting

I maintain micro-services for three start-ups. Every quarter someone asks:

  • “Can we monitor RED for camping keywords?”
  • “We need Xianyu price drops for camera gear.”
  • “Please archive high-score Zhihu answers for our research team.”

Each request started as a 50-line script and ended in a fragile mess of hard-coded headers, expired cookies and broken XPath. Abstracting the common parts (login, storage, rate-limit, UI) produced LittleCrawler.

1.2 Where it fits

Scenario Platform Type Output Value
Trend spotting RED search JSON Discover next week’s hot topic before it peaks
Price intelligence Xianyu search Excel Daily price mean & median for any SKU
Content backup Zhihu creator SQLite Full off-platform copy of your answers

2. Architecture at a Glance

Core question: “How is the code organised and why should I trust it will scale from 100 to 100 000 records?”
One-sentence answer: Async Python 3.11 + Playwright delivers head-browser concurrency while a clean storage interface lets you swap CSV for MongoDB with zero code changes.

2.1 Async pipeline

  • asyncio task limiter keeps 20–50 browser contexts per 4-core VM
  • Each context fetches, renders and parses in <3 s; RAM stays under 300 MB
  • Results stream into storage through async/await—no blocking I/O

2.2 Anti-detection tricks baked in

Feature Default Purpose
CDP mode ON Talk to Chromium via DevTools Protocol—no WebDriver flag
Fingerprint randomiser ON User-Agent, WebGL noise, viewport resize every session
Proxy rotator OFF* Optional IP rotation via services/proxy/

*Set ENABLE_IP_PROXY=True and feed a text file of http://user:pass@ip:port lines.

2.3 Storage abstraction

All adapters inherit BaseStorage and implement connect(), insert(), close().
Switching from JSON to MySQL is a one-word change in base_config.py; no business logic touches SQL.


3. Ten-minute Quick-start

Core question: “I have a clean Ubuntu 22.04 box—how long until I see real data?”
One-sentence answer: Four commands and one QR-code scan; you’ll have 1 000 RED posts in under five minutes.

3.1 Install

# 1. Get code
git clone https://github.com/pbeenig/LittleCrawler.git
cd LittleCrawler

# 2. Dependencies (uv is faster than pip)
uv sync                 # or: pip install -r requirements.txt
playwright install chromium

3.2 Minimum viable config

Edit config/base_config.py:

PLATFORM = "xhs"           # RED
KEYWORDS = "camping,coffee"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "json"

3.3 Run

python main.py

A Chromium window opens, shows a QR code; scan with RED app.
Terminal prints:

[INFO] 2026-01-10 11:12:03 | Login succeeded
[INFO] 2026-01-10 11:12:04 | Crawling 0–20 ...

After 1 000 posts you’ll find data/xhs_search_20260110_111503.json.

Author’s reflection
The first time I ran this I picked CSV output because “everyone loves Excel”. Sorting 30 000 rows later taught me a lesson—start with SQLite even for prototypes; migrations are free.


4. Web Console: Click, Run, Download

Core question: “Can my non-engineering teammate launch a crawl and download Excel?”
One-sentence answer: Yes—compile the Next.js front-end once and the UI guides her through login, keyword entry and file export.

4.1 Build & launch

# build front-end into api/ui
cd web && npm run build

# start API + static pages
uv run uvicorn api.main:app --port 8080 --reload

Visit http://127.0.0.1:8080 → three screens:

  1. Login – QR or Cookie
  2. Run – dropdown for platform/type/keyword
  3. Data – live log + download button when finished

4.2 Dev mode (API only)

If you already have a separate front-end team:

API_ONLY=1 uv run uvicorn api.main:app --port 8080 --reload
cd web && npm run dev      # hot-reload for UI tweaks

4.3 UI preview

Screen What you see Why it matters
Login QR code refreshes every 5 s No need to restart if it times out
Run progress bar + log tail Phone-friendly; you can monitor while commuting
Export buttons for JSON/Excel/SQLite Same data, multiple formats without rerunning

5. Deep Dive: Three Real Workloads

Core question: “Show me complete examples—config, run, gotcha, result.”
One-sentence answer: Below are verbatim configs and author notes from recent production jobs.

5.1 Use-case A – RED camping trend (30-day daily pull)

Goal Top 1 000 newest posts for keyword “camping”, compute daily like-growth.
Config

PLATFORM = "xhs"
KEYWORDS = "camping"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "sqlite"

Notes

  • Added column insert_date DEFAULT (date('now'))
  • Next run skips duplicates with
    WHERE note_id NOT IN (SELECT note_id FROM xhs WHERE insert_date = date('now','-1 day'))
  • SQLite query to build top-50 ranking exported to Markdown and pasted into Feishu group every morning.

5.2 Use-case B – Xianyu price alert for Fuji X100V

Goal Email when 7-day median < CNY 4 500.
Config

PLATFORM = "xhy"
KEYWORDS = "富士X100V"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "mongodb"
ENABLE_IP_PROXY = True

Notes

  • MongoDB aggregation pipeline calculates rolling median.
  • A 5-line Python helper watches the result collection and calls SendGrid if condition met.
  • Without proxy you hit captcha after ~200 requests; with proxy pool (free tier 50 IPs) we routinely collect 5 000 items/day.

5.3 Use-case C – Zhihu high-score answer export

Goal 500 top-voted answers from topic “2026 Spring Festival travel” in a single Excel file for offline reading on a train.
Config

PLATFORM = "zhihu"
KEYWORDS = "2026春运"
CRAWLER_TYPE = "search"
SAVE_DATA_OPTION = "excel"

Notes

  • Excel column =HYPERLINK("https://www.zhihu.com/question/"&A2,"open") gives one-click jump.
  • Zhihu rate-limit is toughest; we use cookie login (z_c0) and exponential back-off built into the base crawler (max 5 retries, 30 s → 5 min).
  • Headless runs fine on a 2-core cloud box; 500 answers ≈ 12 min.

6. Code Walk-through: Where to Change What

Core question: “I need to add TikTok support and store to BigQuery—where do I poke?”
One-sentence answer: Add a folder under platforms/ for crawler logic and a file under storage/ for BigQuery adapter—each <200 lines thanks to base classes.

6.1 Project tree (annotated)

LittleCrawler
├── main.py                  # CLI entry, argparse
├── config/base_config.py    # single source of truth
├── src
│   ├── core
│   │   ├── base_crawler.py  # defines async run(), login(), save()
│   │   └── context.py       # keeps one browser instance per task
│   ├── platforms            # one package per site
│   │   ├── xhs/search.py    # RED search strategy
│   │   ├── xhy/search.py    # Xianyu search strategy
│   │   └── zhihu/search.py  # Zhihu search strategy
│   ├── storage              # plug-ins
│   │   ├── sqlite_storage.py
│   │   ├── mongodb_storage.py
│   │   └── excel_storage.py
│   └── models               # Pydantic schemas for type safety
├── api/main.py              # FastAPI routes
├── web                      # Next.js front-end source
└── libs                     # third-party JS injected into pages

6.2 Adding a new platform (example skeleton)

Create platforms/tiktok/crawler.py:

from src.core.base_crawler import BaseCrawler
class TikTokCrawler(BaseCrawler):
    async def login(self): ...
    async def crawl(self): ...

Register in platforms/__init__.py and add "tiktok" to the UI dropdown—done.

6.3 Adding a new storage backend

Create storage/bigquery_storage.py:

from src.storage.base_storage import BaseStorage
class BigQueryStorage(BaseStorage):
    async def connect(self): ...
    async def insert(self, items): ...
    async def close(self): ...

Set SAVE_DATA_OPTION = "bigquery" in config; no other code changes.


7. Troubleshooting Cheat-sheet

Core question: “Captcha, 403, broken XPath—what do I do quickly?”
One-sentence answer: Use built-in logs, captcha folder, retry policy and proxy toggles before touching code.

Symptom Quick fix
QR code shows “environment abnormal” Enable proxy + CDP; restart browser
403 after 200 requests (Xianyu) Set ENABLE_IP_PROXY=True
MongoDB inserts slow Batch size auto 100; increase in insert() if needed
Excel corrupt on Chinese title openpyxl used by default; set quote_prefix=True
Playwright timeout Lower concurrency in base_config.py (MAX_CONCURRENCY)

8. Practical Take-away & Action Checklist

  1. Install: uv sync && playwright install chromium
  2. Tweak four lines in config/base_config.py
  3. Run: python main.py (CLI) or uvicorn api.main:app (web)
  4. Scan QR / paste cookie; data lands in data/ or your DB
  5. Switch storage by changing one string—no refactoring
  6. Schedule with cron or GitHub Actions
  7. Extend platforms/storage by inheriting base classes—≈30 min job

One-page Overview

LittleCrawler is an MIT-licensed async framework that couples Playwright-driven browsers with pluggable storage (CSV → BigQuery) and a FastAPI/Next.js console. Out-of-the-box it logs in, circumvents basic anti-bot checks and dumps structured data for RED, Xianyu and Zhihu. Scaling from a laptop prototype to a cloud cron job requires only config edits; adding new sources or sinks is a matter of writing one small class. The goal: let engineers analyse data instead of maintaining fragile scrapers.


FAQ

  1. Does it run on Windows or macOS?
    Yes—any OS that runs Python 3.11 and Chromium.

  2. How long does the cookie/QR session last?
    RED ≈30 days, Zhihu 7 days, Xianyu 2–3 days; the UI shows expiry and lets you re-auth instantly.

  3. Can I feed thousands of keywords?
    Comma-separate in KEYWORDS or start multiple docker containers with different config files.

  4. Is headless mode supported?
    Absolutely—set HEADLESS=True; captchas are saved to logs/captcha/ for manual solving.

  5. Do I violate ToS?
    The tool accesses only publicly visible pages and respects built-in delays; still, you must comply with each platform’s terms and local laws.

  6. Why SQLite over CSV for prototypes?
    Relational queries, indexes and upserts are free, and you can export to Excel later with one command.

  7. Can I use my existing proxy pool?
    Yes—drop a text file with http://user:pass@ip:port lines and set ENABLE_IP_PROXY=True.

  8. How hard is it to add TikTok, Instagram or Amazon?
    Create a package under platforms/, inherit BaseCrawler, implement three methods; most developers finish in under two hours.

Exit mobile version