sese-engine: A Pocket-Sized Search Engine You Can Run on a Raspberry Pi
Core question answered in one line: Can a single Python script replace Google for your private web corner? Yes—sese-engine builds a personal index you control, on hardware cheaper than a pizza.
1 Why Bother Building Another Search Engine?
Core question: “Google and Baidu already exist—why roll my own?”
Because ranking secrecy, ads, and disappearing pages hurt research. sese-engine keeps crawl rules, index data, and ranking weights on your disk, visible and editable.
Author’s reflection: After losing half a day scrolling past ads for “best VPN” while hunting RFC drafts, I decided my lab deserved an engine that returns technical docs—nothing else.
2 What Exactly Is sese-engine? (30-Second Primer)
Core question: “What am I installing, in plain English?”
A zero-database, pure-Python crawler + indexer + search API. Feed it URLs, it spits back JSON search results. No cloud account, no fees, no black-box algorithm.
3 How It Works: From URL to JSON in Four Steps
Core question: “What happens inside the box after I type a URL?”
Step | Module | Input → Output | Default Resource |
---|---|---|---|
① Fetch | Spider | seed list → raw HTML | 1-2 CPU cores |
② Parse | Cleaner | HTML → plain text | RAM spike ~200 MB |
③ Index | Indexer | text → inverted lists | 1-2 GB RAM |
④ Query | Search | keyword → ranked JSON | <50 MB RAM |
Scenario: A grad student wants only .edu
slides on “SDN”. She lists four university domains in WHITE_DOMAIN
, depth-limits to 2, and within an hour has 6 000 PDF-free pages searchable by keyword.
4 Hardware & Cost Reality Check
Core question: “How small is ‘small’ hardware?”
Official reference host: 2 vCPU, 4 GB RAM, 128 GB SSD, 5 Mbps line—about US $12 per year on discount cloud.
Raspberry Pi 4 (4 GB) with an old 128 GB USB-stick handles 100 k pages happily; Pi’s idle power <3 W.
5 Installation Walk-Through (3 Steps, 5 Minutes)
Core question: “What is the absolute fastest path to a first search result?”
-
Install Python 3.8 (3.9+ fails on some wheels). -
Clone & fetch deps git clone https://github.com/YunYouJun/sese-engine.git cd sese-engine pip install -r requirements.txt
-
Launch -
Windows: double-click 启动.cmd
-
Linux/macOS: bash 启动.sh
-
Validation:
curl "http://127.0.0.1/search?q=test"
First call may pause 2–3 s if the host swapped the process out; afterwards <200 ms is normal.
6 Configuration Deep Dive: Five Knobs That Save Your CPU
Core question: “Which settings keep the crawler polite and my cloud bill zero?”
Parameter | Default | When to Tweak | Example Scenario |
---|---|---|---|
MAX_DEPTH | 3 | Forum thread only → 1; full site mirror → 5 | |
CONCURRENT | 8 | 1-core box → 2; 8-core idle → 20 | |
DELAY | 0.5 s | robots.txt asks 1 s → 1.2 | |
WHITE_DOMAIN | [] | Gov open data only → [“gov.cn”] | |
INDEX_SEGMENT | 10 000 | Million-page tier → 50 000 to reduce merge frequency |
Author’s reflection: I once blasted CONCURRENT to 50 on a free tier VM; the target academic site returned 502 errors and my IP got a 24 h ban. Polite crawling is faster than being blocked.
7 Search API: Browser, Python, Shell—Your Choice
Core question: “How do I actually query the index?”
Endpoint:
GET /search?q=<keyword>&page=<page>&size=<size>
Example: Python script exports top 50 hits for each keyword in a list.
import requests, csv
queries = ["sdn", "nfv", "tsn"]
rows = []
for q in queries:
r = requests.get("http://127.0.0.1/search", params={"q": q, "size": 50})
for item in r.json()["results"]:
rows.append([q, item["title"], item["url"]])
with open("export.csv", "w", newline='', encoding='utf-8') as f:
csv.writer(f).writerows(rows)
Run it, open export.csv
, you have a ready reference table with zero mouse clicks.
8 Front-End & Docker: Giving the Engine a Face
Core question: “I don’t want to curl for guests—can I have a search box?”
-
Official UI repo: YunYouJun/sese-engine-ui
Clone →npm i && npm run dev
→ responsive page ready. -
Generic Docker (x86):
docker run -p 8080:8080 -v ${PWD}/data:/app/data xiongnemo/sese-engine
-
ARM Docker for Pi:
docker run -p 8080:8080 -v ${PWD}/data:/app/data mengguyi/sese-engine-docker
Scenario: A librarian runs the ARM image on a Pi tucked behind the desk; students access http://10.0.0.88
for a curated tech-report collection without leaving the intranet.
9 Monitoring with Grafana: Spot a Hung Crawl in One Glance
Core question: “How do I know the crawler is alive and not stuck?”
Import the dashboard JSON in grafana/
. Key panels:
-
Queue size trending → add more seeds or reduce depth -
Download success rate → detect IP ban early -
Index merge duration → decide on SEGMENT size -
Query P99 latency → keep user experience acceptable
Screenshot: traffic-light colours show green <100 ms, yellow <500 ms, red >1 s.
10 Performance Baseline: How Much Index Can 70 Yuan Buy?
Core question: “What scale fits the cheapest cloud instance?”
Pages | Raw Text | Index Size | Search Latency | Notes |
---|---|---|---|---|
100 k | 4 GB | 1.1 GB | 30 ms | Pi 4B 30 % idle |
1 M | 40 GB | 11 GB | 60 ms | 2 vCPU 70 % spike during merge |
5 M | 200 GB | 55 GB | 120 ms | Recommend 4 vCPU & 20 Mbps |
Move index folder to SSD and latency drops 3× on the same box.
11 Limitations & Work-Arounds
Core question: “Where should I stop expecting Google-grade magic?”
-
No built-in PageRank—attach your own domain score field if needed -
No JavaScript rendering—use headless Chrome upstream, then feed static HTML -
Chinese word segmentation not bundled—plug jieba or pkuseg inside 配置.py
-
Not real-time—default is crawl-then-index; trigger incremental merge for fresher data
Author’s reflection: I once tried to index a Twitter-like feed and learnt the hard way: fire-hose sites with aggressive anti-scraping are simply out of scope for a hobby-box. sese-engine shines on open, static, or permission-granted collections.
12 When to Pick sese-engine, When to Stay With Google
Requirement | Recommendation |
---|---|
Full control over ranking weights | sese-engine |
Trillion-page, real-time discovery | Google/Baidu |
<5 Mbps bandwidth budget | sese-engine |
Legal need to keep data in-house | sese-engine |
Zero maintenance staff | Google/Baidu |
Rule of thumb: data sovereignty > scale ⇒ sese-engine; scale > sovereignty ⇒ public engines.
13 Practical Action Checklist
-
Install Python 3.8 → pip install -r requirements.txt
-
Edit 配置.py
: list white-domains, set CONCURRENT 2–8, delay 0.5–1 s -
Launch with 启动.sh
→curl "http://127.0.0.1/search?q=test"
for smoke test -
Optional: clone UI repo, change VITE_API_URL
,npm run build
-
Optional: docker run
with volume mount for portable deployment -
Backup: tar data/index/
nightly—your index is rebuildable but not free to re-crawl
14 One-Page Overview
sese-engine = spider + indexer + HTTP API in one Python folder. No database, no cloud keys. Crawl targets you choose, build inverted index locally, search via JSON calls. Runs on $12-a-year VPS or Pi, handles 100 k pages while idling. UI and Docker images exist; Grafana dashboard included. Trade-offs: no JS execution, no real-time fire-hose, limited by polite-crawl speed. Best for vertical, permission-allowed, small-to-medium corpora where owning the ranking logic beats global scale.
15 FAQ
Q1: Python 3.9 install fails—fix?
A: Some wheels compile only up to 3.8; stay on 3.8.x.
Q2: Daily incremental updates?
A: Add cron bash 启动.sh --incremental
and enable MERGE_ON_EXIT
in config.
Q3: Index corruption after power loss?
A: Remove .lock
files under data/index/
, restart; engine rolls back to last commit.
Q4: Commercial use allowed?
A: MIT license—yes, but respect target-site ToS and copyright.
Q5: First query 3 s delay—normal?
A: Yes, cheap host swapped process to disk; later calls <200 ms once resident.
Q6: English stemming possible?
A: Swap default space-splitter for nltk or spaCy inside 配置.py
.
Q7: Front-end framework agnostic?
A: API returns plain JSON; any stack (React, Vue, mobile app) can consume it.