Site icon Efficient Coder

sese-engine: Build a Personal Search Engine on Raspberry Pi for Under $12/Year

sese-engine: A Pocket-Sized Search Engine You Can Run on a Raspberry Pi

Core question answered in one line: Can a single Python script replace Google for your private web corner? Yes—sese-engine builds a personal index you control, on hardware cheaper than a pizza.


1 Why Bother Building Another Search Engine?

Core question: “Google and Baidu already exist—why roll my own?”
Because ranking secrecy, ads, and disappearing pages hurt research. sese-engine keeps crawl rules, index data, and ranking weights on your disk, visible and editable.

Author’s reflection: After losing half a day scrolling past ads for “best VPN” while hunting RFC drafts, I decided my lab deserved an engine that returns technical docs—nothing else.


2 What Exactly Is sese-engine? (30-Second Primer)

Core question: “What am I installing, in plain English?”
A zero-database, pure-Python crawler + indexer + search API. Feed it URLs, it spits back JSON search results. No cloud account, no fees, no black-box algorithm.


3 How It Works: From URL to JSON in Four Steps

Core question: “What happens inside the box after I type a URL?”

Step Module Input → Output Default Resource
① Fetch Spider seed list → raw HTML 1-2 CPU cores
② Parse Cleaner HTML → plain text RAM spike ~200 MB
③ Index Indexer text → inverted lists 1-2 GB RAM
④ Query Search keyword → ranked JSON <50 MB RAM

Scenario: A grad student wants only .edu slides on “SDN”. She lists four university domains in WHITE_DOMAIN, depth-limits to 2, and within an hour has 6 000 PDF-free pages searchable by keyword.


4 Hardware & Cost Reality Check

Core question: “How small is ‘small’ hardware?”
Official reference host: 2 vCPU, 4 GB RAM, 128 GB SSD, 5 Mbps line—about US $12 per year on discount cloud.
Raspberry Pi 4 (4 GB) with an old 128 GB USB-stick handles 100 k pages happily; Pi’s idle power <3 W.


5 Installation Walk-Through (3 Steps, 5 Minutes)

Core question: “What is the absolute fastest path to a first search result?”

  1. Install Python 3.8 (3.9+ fails on some wheels).
  2. Clone & fetch deps
    git clone https://github.com/YunYouJun/sese-engine.git
    cd sese-engine
    pip install -r requirements.txt
    
  3. Launch
    • Windows: double-click 启动.cmd
    • Linux/macOS: bash 启动.sh

Validation:

curl "http://127.0.0.1/search?q=test"

First call may pause 2–3 s if the host swapped the process out; afterwards <200 ms is normal.


6 Configuration Deep Dive: Five Knobs That Save Your CPU

Core question: “Which settings keep the crawler polite and my cloud bill zero?”

Parameter Default When to Tweak Example Scenario
MAX_DEPTH 3 Forum thread only → 1; full site mirror → 5
CONCURRENT 8 1-core box → 2; 8-core idle → 20
DELAY 0.5 s robots.txt asks 1 s → 1.2
WHITE_DOMAIN [] Gov open data only → [“gov.cn”]
INDEX_SEGMENT 10 000 Million-page tier → 50 000 to reduce merge frequency

Author’s reflection: I once blasted CONCURRENT to 50 on a free tier VM; the target academic site returned 502 errors and my IP got a 24 h ban. Polite crawling is faster than being blocked.


7 Search API: Browser, Python, Shell—Your Choice

Core question: “How do I actually query the index?”

Endpoint:

GET /search?q=<keyword>&page=<page>&size=<size>

Example: Python script exports top 50 hits for each keyword in a list.

import requests, csv

queries = ["sdn", "nfv", "tsn"]
rows = []

for q in queries:
    r = requests.get("http://127.0.0.1/search", params={"q": q, "size": 50})
    for item in r.json()["results"]:
        rows.append([q, item["title"], item["url"]])

with open("export.csv", "w", newline='', encoding='utf-8') as f:
    csv.writer(f).writerows(rows)

Run it, open export.csv, you have a ready reference table with zero mouse clicks.


8 Front-End & Docker: Giving the Engine a Face

Core question: “I don’t want to curl for guests—can I have a search box?”

  • Official UI repo: YunYouJun/sese-engine-ui
    Clone → npm i && npm run dev → responsive page ready.

  • Generic Docker (x86):

    docker run -p 8080:8080 -v ${PWD}/data:/app/data xiongnemo/sese-engine
    
  • ARM Docker for Pi:

    docker run -p 8080:8080 -v ${PWD}/data:/app/data mengguyi/sese-engine-docker
    

Scenario: A librarian runs the ARM image on a Pi tucked behind the desk; students access http://10.0.0.88 for a curated tech-report collection without leaving the intranet.


9 Monitoring with Grafana: Spot a Hung Crawl in One Glance

Core question: “How do I know the crawler is alive and not stuck?”

Import the dashboard JSON in grafana/. Key panels:

  • Queue size trending → add more seeds or reduce depth
  • Download success rate → detect IP ban early
  • Index merge duration → decide on SEGMENT size
  • Query P99 latency → keep user experience acceptable


Screenshot: traffic-light colours show green <100 ms, yellow <500 ms, red >1 s.


10 Performance Baseline: How Much Index Can 70 Yuan Buy?

Core question: “What scale fits the cheapest cloud instance?”

Pages Raw Text Index Size Search Latency Notes
100 k 4 GB 1.1 GB 30 ms Pi 4B 30 % idle
1 M 40 GB 11 GB 60 ms 2 vCPU 70 % spike during merge
5 M 200 GB 55 GB 120 ms Recommend 4 vCPU & 20 Mbps

Move index folder to SSD and latency drops 3× on the same box.


11 Limitations & Work-Arounds

Core question: “Where should I stop expecting Google-grade magic?”

  • No built-in PageRank—attach your own domain score field if needed
  • No JavaScript rendering—use headless Chrome upstream, then feed static HTML
  • Chinese word segmentation not bundled—plug jieba or pkuseg inside 配置.py
  • Not real-time—default is crawl-then-index; trigger incremental merge for fresher data

Author’s reflection: I once tried to index a Twitter-like feed and learnt the hard way: fire-hose sites with aggressive anti-scraping are simply out of scope for a hobby-box. sese-engine shines on open, static, or permission-granted collections.


12 When to Pick sese-engine, When to Stay With Google

Requirement Recommendation
Full control over ranking weights sese-engine
Trillion-page, real-time discovery Google/Baidu
<5 Mbps bandwidth budget sese-engine
Legal need to keep data in-house sese-engine
Zero maintenance staff Google/Baidu

Rule of thumb: data sovereignty > scale ⇒ sese-engine; scale > sovereignty ⇒ public engines.


13 Practical Action Checklist

  1. Install Python 3.8 → pip install -r requirements.txt
  2. Edit 配置.py: list white-domains, set CONCURRENT 2–8, delay 0.5–1 s
  3. Launch with 启动.shcurl "http://127.0.0.1/search?q=test" for smoke test
  4. Optional: clone UI repo, change VITE_API_URL, npm run build
  5. Optional: docker run with volume mount for portable deployment
  6. Backup: tar data/index/ nightly—your index is rebuildable but not free to re-crawl

14 One-Page Overview

sese-engine = spider + indexer + HTTP API in one Python folder. No database, no cloud keys. Crawl targets you choose, build inverted index locally, search via JSON calls. Runs on $12-a-year VPS or Pi, handles 100 k pages while idling. UI and Docker images exist; Grafana dashboard included. Trade-offs: no JS execution, no real-time fire-hose, limited by polite-crawl speed. Best for vertical, permission-allowed, small-to-medium corpora where owning the ranking logic beats global scale.


15 FAQ

Q1: Python 3.9 install fails—fix?
A: Some wheels compile only up to 3.8; stay on 3.8.x.

Q2: Daily incremental updates?
A: Add cron bash 启动.sh --incremental and enable MERGE_ON_EXIT in config.

Q3: Index corruption after power loss?
A: Remove .lock files under data/index/, restart; engine rolls back to last commit.

Q4: Commercial use allowed?
A: MIT license—yes, but respect target-site ToS and copyright.

Q5: First query 3 s delay—normal?
A: Yes, cheap host swapped process to disk; later calls <200 ms once resident.

Q6: English stemming possible?
A: Swap default space-splitter for nltk or spaCy inside 配置.py.

Q7: Front-end framework agnostic?
A: API returns plain JSON; any stack (React, Vue, mobile app) can consume it.

Exit mobile version