DeepScrape: Turn Any Website into Clean, Ready-to-Use Data in One Afternoon

A practical, no-hype walkthrough for junior-college graduates who need web data without the headaches.

Person turning messy web pages into neat files

Why You Need a “Web-to-Data Translator”

Picture this common assignment:
“Collect the key facts from 50 technical pages and drop them into Excel.”

The usual route:

  1. Open browser → copy → paste → tidy → repeat 50×.
  2. Run into pop-ups, lazy-loading images, or login walls; time doubles.

DeepScrape compresses those two steps into a single command:
“Give me the URLs; I’ll handle the rest.”


What Exactly Is DeepScrape?

One line:
DeepScrape = Browser Robot + AI Reader + Batch Packer.

Role What it does Everyday analogy
Browser Robot Uses Playwright to open pages, click buttons, scroll A helpful intern who flips through books for you
AI Reader Employs GPT-4o or a local model to turn content into JSON A translator who summarizes each chapter
Batch Packer Handles dozens or hundreds of links, then zips the results A shipping department that labels and boxes everything

Three-Minute Setup (Tested on macOS, Linux, Windows WSL)

1. Bring the code home

git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape
npm install
cp .env.example .env

2. Tell it which brain to use

Open .env, pick one option:

# Option A – cloud, high quality
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here

# Option B – local, fully private
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3:latest

Leave everything else untouched for now.

3. Fire up the server

npm run dev

When you see Server listening on port 3000, half the job is done.
Visit http://localhost:3000/health; a small JSON object {"status":"ok"} confirms success.

Terminal showing server start

Five Commands That Solve 90 % of Real Tasks

All snippets are copy-paste ready; swap your-secret-key with the value you placed in .env.

1. One-page quick read: turn an article into Markdown

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://example.com/article",
    "options": { "extractorFormat": "markdown" }
  }' | jq -r '.content' > article.md

Thirty seconds later you have a clean article.md with headings, images, and code blocks preserved.

2. Structured extraction: let AI act as your research assistant

Suppose you only need “title, author, publish date”.
Write a short note—called a JSON Schema—and hand it to DeepScrape:

curl -X POST http://localhost:3000/api/extract-schema \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://news.example.com/tech/123",
    "schema": {
      "type": "object",
      "properties": {
        "title":   { "type": "string", "description": "Headline of the article" },
        "author":  { "type": "string", "description": "Name of the writer" },
        "publishDate": { "type": "string", "description": "ISO date like 2024-07-21" }
      },
      "required": ["title"]
    }
  }' | jq -r '.extractedData'

You’ll get:

{
  "title": "Quantum Computing Breakthrough",
  "author": "Alex Lee",
  "publishDate": "2024-07-21"
}

3. Batch “harvest”: 50 links in one go

Put URLs in an array, set concurrency, and let the queue run:

curl -X POST http://localhost:3000/api/batch/scrape \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "urls": [
      "https://docs.a.com/start",
      "https://docs.a.com/api",
      "https://docs.a.com/sdk"
    ],
    "concurrency": 3,
    "options": { "extractorFormat": "markdown" }
  }'

Response:

{
  "batchId": "550e8400...",
  "statusUrl": "http://localhost:3000/api/batch/scrape/550e8400.../status"
}

Grab coffee; when you return, download:

curl "http://localhost:3000/api/batch/scrape/550e8400.../download/zip?format=markdown" \
  -H "X-API-Key: your-secret-key" \
  --output batch.zip

Unzip and you’ll see:

1_start.md
2_api.md
3_sdk.md
batch_summary.json
Neat folders on a desk

4. Deep site crawl: archive an entire docs site

Let it spider every page up to two levels deep:

curl -X POST http://localhost:3000/api/crawl \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 100,
    "maxDepth": 2,
    "scrapeOptions": { "extractorFormat": "markdown" }
  }'

When finished, look inside crawl-output/{job-id}/:

2024-07-21_abc123_docs.example.com_intro.md
2024-07-21_abc123_docs.example.com_api_auth.md
...
consolidated.md   # All pages stitched together
consolidated.json # Structured metadata

Privacy & Offline Mode: Keep Your Data at Home

Handling sensitive documents? DeepScrape works 100 % offline.

  1. Pull a small local model:

    docker run -d -p 11434:11434 --name ollama ollama/ollama
    docker exec ollama ollama pull llama3:latest
    
  2. Point .env to your machine:

    LLM_PROVIDER=ollama
    LLM_BASE_URL=http://localhost:11434/v1
    LLM_MODEL=llama3:latest
    
  3. Every request runs locally; not even logs leave your computer.


Common Questions, Plain Answers

Question Straight answer
I’m not a coder—any GUI? REST API first; Postman or curl works. A web playground is on the roadmap.
Will sites block me? Stealth mode is on by default, but respect robots.txt and keep concurrency polite.
Is it free? Code is Apache 2.0; you pay only OpenAI tokens if you choose cloud models.
How is it different from BeautifulSoup? Same low-level engine (Playwright), but DeepScrape adds AI extraction and job queues so you skip writing selectors.

Advanced Tricks: Papers, Manuals, and More

Use-case 1: Compare methodologies across three arXiv papers

Hand the AI a “fact sheet”:

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "authors": { "type": "array", "items": { "type": "string" } },
    "methodology": { "type": "string" },
    "results": { "type": "string" },
    "keyContributions": { "type": "array", "items": { "type": "string" } }
  }
}

Run /api/extract-schema for each PDF landing page, then merge the JSON files into a side-by-side table.

Use-case 2: Turn GitHub permission docs into an internal cheat-sheet

Instead of scrolling, ask for “endpoint + required permission” pairs:

{
  "apiEndpoints": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "endpoint": { "type": "string" },
        "requiredPermissions": { "type": "array", "items": { "type": "string" } }
      }
    }
  }
}

Drop the resulting JSON straight into Notion or Airtable—no manual copy-paste.


Roadmap: What’s Coming Next


  • Browser pool warm-up – faster startup

  • Auto-schema writer – describe what you want in plain English, AI builds the JSON Schema

  • Visual reports – automatic charts after every batch

Final Thoughts

DeepScrape closes the gap between “web page” and “usable data.”
You no longer wrestle with regex, XPath, or pagination logic.
Just:

  1. Hand it a URL.
  2. Tell it what you need.
  3. Collect the result.

The saved hours can go into deeper thinking—like turning that fresh data into insights worth sharing.

Relaxing with coffee after work is done

Found this helpful? Star DeepScrape on GitHub. Run into issues? Open an issue—the community is quick to lend a hand.