sese-engine: Build a Personal Search Engine on Raspberry Pi for Under $12/Year

高效码农

3 months ago

sese-engine: A Pocket-Sized Search Engine You Can Run on a Raspberry Pi

Core question answered in one line: Can a single Python script replace Google for your private web corner? Yes—sese-engine builds a personal index you control, on hardware cheaper than a pizza.

1 Why Bother Building Another Search Engine?

Core question: “Google and Baidu already exist—why roll my own?”
Because ranking secrecy, ads, and disappearing pages hurt research. sese-engine keeps crawl rules, index data, and ranking weights on your disk, visible and editable.

Author’s reflection: After losing half a day scrolling past ads for “best VPN” while hunting RFC drafts, I decided my lab deserved an engine that returns technical docs—nothing else.

2 What Exactly Is sese-engine? (30-Second Primer)

Core question: “What am I installing, in plain English?”
A zero-database, pure-Python crawler + indexer + search API. Feed it URLs, it spits back JSON search results. No cloud account, no fees, no black-box algorithm.

3 How It Works: From URL to JSON in Four Steps

Core question: “What happens inside the box after I type a URL?”

Step	Module	Input → Output	Default Resource
① Fetch	Spider	seed list → raw HTML	1-2 CPU cores
② Parse	Cleaner	HTML → plain text	RAM spike ~200 MB
③ Index	Indexer	text → inverted lists	1-2 GB RAM
④ Query	Search	keyword → ranked JSON	<50 MB RAM

Scenario: A grad student wants only .edu slides on “SDN”. She lists four university domains in WHITE_DOMAIN, depth-limits to 2, and within an hour has 6 000 PDF-free pages searchable by keyword.

4 Hardware & Cost Reality Check

Core question: “How small is ‘small’ hardware?”
Official reference host: 2 vCPU, 4 GB RAM, 128 GB SSD, 5 Mbps line—about US $12 per year on discount cloud.
Raspberry Pi 4 (4 GB) with an old 128 GB USB-stick handles 100 k pages happily; Pi’s idle power <3 W.

5 Installation Walk-Through (3 Steps, 5 Minutes)

Core question: “What is the absolute fastest path to a first search result?”

Install Python 3.8 (3.9+ fails on some wheels).

Clone & fetch deps

git clone https://github.com/YunYouJun/sese-engine.git
cd sese-engine
pip install -r requirements.txt

Launch
- Windows: double-click 启动.cmd
- Linux/macOS: bash 启动.sh

Validation:

curl "http://127.0.0.1/search?q=test"

First call may pause 2–3 s if the host swapped the process out; afterwards <200 ms is normal.

6 Configuration Deep Dive: Five Knobs That Save Your CPU

Core question: “Which settings keep the crawler polite and my cloud bill zero?”

Parameter	Default	When to Tweak
MAX_DEPTH	3	Forum thread only → 1; full site mirror → 5
CONCURRENT	8	1-core box → 2; 8-core idle → 20
DELAY	0.5 s	robots.txt asks 1 s → 1.2
WHITE_DOMAIN	[]	Gov open data only → [“gov.cn”]
INDEX_SEGMENT	10 000	Million-page tier → 50 000 to reduce merge frequency

Author’s reflection: I once blasted CONCURRENT to 50 on a free tier VM; the target academic site returned 502 errors and my IP got a 24 h ban. Polite crawling is faster than being blocked.

7 Search API: Browser, Python, Shell—Your Choice

Core question: “How do I actually query the index?”

Endpoint:

GET /search?q=<keyword>&page=<page>&size=<size>

Example: Python script exports top 50 hits for each keyword in a list.

import requests, csv

queries = ["sdn", "nfv", "tsn"]
rows = []

for q in queries:
    r = requests.get("http://127.0.0.1/search", params={"q": q, "size": 50})
    for item in r.json()["results"]:
        rows.append([q, item["title"], item["url"]])

with open("export.csv", "w", newline='', encoding='utf-8') as f:
    csv.writer(f).writerows(rows)

Run it, open export.csv, you have a ready reference table with zero mouse clicks.

8 Front-End & Docker: Giving the Engine a Face

Core question: “I don’t want to curl for guests—can I have a search box?”

Official UI repo: YunYouJun/sese-engine-ui
Clone → npm i && npm run dev → responsive page ready.

Generic Docker (x86):

docker run -p 8080:8080 -v ${PWD}/data:/app/data xiongnemo/sese-engine

ARM Docker for Pi:

docker run -p 8080:8080 -v ${PWD}/data:/app/data mengguyi/sese-engine-docker

Scenario: A librarian runs the ARM image on a Pi tucked behind the desk; students access http://10.0.0.88 for a curated tech-report collection without leaving the intranet.

9 Monitoring with Grafana: Spot a Hung Crawl in One Glance

Core question: “How do I know the crawler is alive and not stuck?”

Import the dashboard JSON in grafana/. Key panels:

Queue size trending → add more seeds or reduce depth
Download success rate → detect IP ban early
Index merge duration → decide on SEGMENT size
Query P99 latency → keep user experience acceptable

Screenshot: traffic-light colours show green <100 ms, yellow <500 ms, red >1 s.

10 Performance Baseline: How Much Index Can 70 Yuan Buy?

Core question: “What scale fits the cheapest cloud instance?”

Pages	Raw Text	Index Size	Search Latency	Notes
100 k	4 GB	1.1 GB	30 ms	Pi 4B 30 % idle
1 M	40 GB	11 GB	60 ms	2 vCPU 70 % spike during merge
5 M	200 GB	55 GB	120 ms	Recommend 4 vCPU & 20 Mbps

Move index folder to SSD and latency drops 3× on the same box.

11 Limitations & Work-Arounds

Core question: “Where should I stop expecting Google-grade magic?”

No built-in PageRank—attach your own domain score field if needed
No JavaScript rendering—use headless Chrome upstream, then feed static HTML
Chinese word segmentation not bundled—plug jieba or pkuseg inside 配置.py
Not real-time—default is crawl-then-index; trigger incremental merge for fresher data

Author’s reflection: I once tried to index a Twitter-like feed and learnt the hard way: fire-hose sites with aggressive anti-scraping are simply out of scope for a hobby-box. sese-engine shines on open, static, or permission-granted collections.

12 When to Pick sese-engine, When to Stay With Google

Requirement	Recommendation
Full control over ranking weights	sese-engine
Trillion-page, real-time discovery	Google/Baidu
<5 Mbps bandwidth budget	sese-engine
Legal need to keep data in-house	sese-engine
Zero maintenance staff	Google/Baidu

Rule of thumb: data sovereignty > scale ⇒ sese-engine; scale > sovereignty ⇒ public engines.

13 Practical Action Checklist

Install Python 3.8 → pip install -r requirements.txt
Edit 配置.py: list white-domains, set CONCURRENT 2–8, delay 0.5–1 s
Launch with 启动.sh → curl "http://127.0.0.1/search?q=test" for smoke test
Optional: clone UI repo, change VITE_API_URL, npm run build
Optional: docker run with volume mount for portable deployment
Backup: tar data/index/ nightly—your index is rebuildable but not free to re-crawl

14 One-Page Overview

sese-engine = spider + indexer + HTTP API in one Python folder. No database, no cloud keys. Crawl targets you choose, build inverted index locally, search via JSON calls. Runs on $12-a-year VPS or Pi, handles 100 k pages while idling. UI and Docker images exist; Grafana dashboard included. Trade-offs: no JS execution, no real-time fire-hose, limited by polite-crawl speed. Best for vertical, permission-allowed, small-to-medium corpora where owning the ranking logic beats global scale.

15 FAQ

Q1: Python 3.9 install fails—fix?
A: Some wheels compile only up to 3.8; stay on 3.8.x.

Q2: Daily incremental updates?
A: Add cron bash 启动.sh --incremental and enable MERGE_ON_EXIT in config.

Q3: Index corruption after power loss?
A: Remove .lock files under data/index/, restart; engine rolls back to last commit.

Q4: Commercial use allowed?
A: MIT license—yes, but respect target-site ToS and copyright.

Q5: First query 3 s delay—normal?
A: Yes, cheap host swapped process to disk; later calls <200 ms once resident.

Q6: English stemming possible?
A: Swap default space-splitter for nltk or spaCy inside 配置.py.

Q7: Front-end framework agnostic?
A: API returns plain JSON; any stack (React, Vue, mobile app) can consume it.