DeepScrape: Turn Any Website into Clean, Ready-to-Use Data in One Afternoon
A practical, no-hype walkthrough for junior-college graduates who need web data without the headaches.
Why You Need a “Web-to-Data Translator”
Picture this common assignment:
“Collect the key facts from 50 technical pages and drop them into Excel.”
The usual route:
-
Open browser → copy → paste → tidy → repeat 50×. -
Run into pop-ups, lazy-loading images, or login walls; time doubles.
DeepScrape compresses those two steps into a single command:
“Give me the URLs; I’ll handle the rest.”
What Exactly Is DeepScrape?
One line:
DeepScrape = Browser Robot + AI Reader + Batch Packer.
Role | What it does | Everyday analogy |
---|---|---|
Browser Robot | Uses Playwright to open pages, click buttons, scroll | A helpful intern who flips through books for you |
AI Reader | Employs GPT-4o or a local model to turn content into JSON | A translator who summarizes each chapter |
Batch Packer | Handles dozens or hundreds of links, then zips the results | A shipping department that labels and boxes everything |
Three-Minute Setup (Tested on macOS, Linux, Windows WSL)
1. Bring the code home
git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape
npm install
cp .env.example .env
2. Tell it which brain to use
Open .env
, pick one option:
# Option A – cloud, high quality
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here
# Option B – local, fully private
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3:latest
Leave everything else untouched for now.
3. Fire up the server
npm run dev
When you see Server listening on port 3000
, half the job is done.
Visit http://localhost:3000/health; a small JSON object {"status":"ok"}
confirms success.
Five Commands That Solve 90 % of Real Tasks
All snippets are copy-paste ready; swap
your-secret-key
with the value you placed in.env
.
1. One-page quick read: turn an article into Markdown
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://example.com/article",
"options": { "extractorFormat": "markdown" }
}' | jq -r '.content' > article.md
Thirty seconds later you have a clean article.md
with headings, images, and code blocks preserved.
2. Structured extraction: let AI act as your research assistant
Suppose you only need “title, author, publish date”.
Write a short note—called a JSON Schema—and hand it to DeepScrape:
curl -X POST http://localhost:3000/api/extract-schema \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://news.example.com/tech/123",
"schema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Headline of the article" },
"author": { "type": "string", "description": "Name of the writer" },
"publishDate": { "type": "string", "description": "ISO date like 2024-07-21" }
},
"required": ["title"]
}
}' | jq -r '.extractedData'
You’ll get:
{
"title": "Quantum Computing Breakthrough",
"author": "Alex Lee",
"publishDate": "2024-07-21"
}
3. Batch “harvest”: 50 links in one go
Put URLs in an array, set concurrency, and let the queue run:
curl -X POST http://localhost:3000/api/batch/scrape \
-H "X-API-Key: your-secret-key" \
-d '{
"urls": [
"https://docs.a.com/start",
"https://docs.a.com/api",
"https://docs.a.com/sdk"
],
"concurrency": 3,
"options": { "extractorFormat": "markdown" }
}'
Response:
{
"batchId": "550e8400...",
"statusUrl": "http://localhost:3000/api/batch/scrape/550e8400.../status"
}
Grab coffee; when you return, download:
curl "http://localhost:3000/api/batch/scrape/550e8400.../download/zip?format=markdown" \
-H "X-API-Key: your-secret-key" \
--output batch.zip
Unzip and you’ll see:
1_start.md
2_api.md
3_sdk.md
batch_summary.json
4. Deep site crawl: archive an entire docs site
Let it spider every page up to two levels deep:
curl -X POST http://localhost:3000/api/crawl \
-H "X-API-Key: your-secret-key" \
-d '{
"url": "https://docs.example.com",
"limit": 100,
"maxDepth": 2,
"scrapeOptions": { "extractorFormat": "markdown" }
}'
When finished, look inside crawl-output/{job-id}/
:
2024-07-21_abc123_docs.example.com_intro.md
2024-07-21_abc123_docs.example.com_api_auth.md
...
consolidated.md # All pages stitched together
consolidated.json # Structured metadata
Privacy & Offline Mode: Keep Your Data at Home
Handling sensitive documents? DeepScrape works 100 % offline.
-
Pull a small local model:
docker run -d -p 11434:11434 --name ollama ollama/ollama docker exec ollama ollama pull llama3:latest
-
Point
.env
to your machine:LLM_PROVIDER=ollama LLM_BASE_URL=http://localhost:11434/v1 LLM_MODEL=llama3:latest
-
Every request runs locally; not even logs leave your computer.
Common Questions, Plain Answers
Question | Straight answer |
---|---|
I’m not a coder—any GUI? | REST API first; Postman or curl works. A web playground is on the roadmap. |
Will sites block me? | Stealth mode is on by default, but respect robots.txt and keep concurrency polite. |
Is it free? | Code is Apache 2.0; you pay only OpenAI tokens if you choose cloud models. |
How is it different from BeautifulSoup? | Same low-level engine (Playwright), but DeepScrape adds AI extraction and job queues so you skip writing selectors. |
Advanced Tricks: Papers, Manuals, and More
Use-case 1: Compare methodologies across three arXiv papers
Hand the AI a “fact sheet”:
{
"type": "object",
"properties": {
"title": { "type": "string" },
"authors": { "type": "array", "items": { "type": "string" } },
"methodology": { "type": "string" },
"results": { "type": "string" },
"keyContributions": { "type": "array", "items": { "type": "string" } }
}
}
Run /api/extract-schema
for each PDF landing page, then merge the JSON files into a side-by-side table.
Use-case 2: Turn GitHub permission docs into an internal cheat-sheet
Instead of scrolling, ask for “endpoint + required permission” pairs:
{
"apiEndpoints": {
"type": "array",
"items": {
"type": "object",
"properties": {
"endpoint": { "type": "string" },
"requiredPermissions": { "type": "array", "items": { "type": "string" } }
}
}
}
}
Drop the resulting JSON straight into Notion or Airtable—no manual copy-paste.
Roadmap: What’s Coming Next
- •
Browser pool warm-up – faster startup - •
Auto-schema writer – describe what you want in plain English, AI builds the JSON Schema - •
Visual reports – automatic charts after every batch
Final Thoughts
DeepScrape closes the gap between “web page” and “usable data.”
You no longer wrestle with regex, XPath, or pagination logic.
Just:
-
Hand it a URL. -
Tell it what you need. -
Collect the result.
The saved hours can go into deeper thinking—like turning that fresh data into insights worth sharing.
Found this helpful? Star DeepScrape on GitHub. Run into issues? Open an issue—the community is quick to lend a hand.