用一杯咖啡的时间，把任何网站变成干净的数据

“

本文将带你认识 DeepScrape——一个把网页内容“拆解-清洗-打包”成结构化数据的工具。无论你是做研究、写报告，还是想把海量网页塞进 AI 知识库，它都能帮你省下大把手工复制粘贴的时间。

为什么你需要“网页-数据翻译器”？

想像一个场景：老师让你把 50 篇技术文档的核心信息抽出来，做成 Excel。
传统路径是：

打开浏览器 → 复制 → 粘贴 → 调格式 → 循环 50 次。
如果页面有弹窗、懒加载、登录限制，时间直接翻倍。

DeepScrape 把这两步压缩成一条命令：
“把网址给我，剩下的我来。”

DeepScrape 是什么？

一句话：
DeepScrape = 浏览器机器人 + AI 阅读器 + 批量打包机。

角色	做的事	类比
浏览器机器人	用 Playwright 打开网页、点按钮、滚页面	帮你翻书的助手
AI 阅读器	用 GPT-4o 或本地模型把内容变 JSON	把书翻译成要点
批量打包机	同时处理几十、上百条链接，最后打包成 ZIP 或 JSON	快递打包站

三分钟启动

“

以下步骤在 macOS、Linux、Windows WSL 均可复现。

1. 把代码抱回家

git clone https://github.com/stretchcloud/deepscrape.git
cd deepscrape
npm install
cp .env.example .env

2. 写配置：告诉它你想用谁的大脑

打开 .env，二选一：

# 方案 A：OpenAI（需要网络，质量高）
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-你的钥匙

# 方案 B：本地 Ollama（无需外网，隐私好）
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3:latest

其他保持默认即可。

3. 启动服务

npm run dev

看到 Server listening on port 3000 就成功一半了。
浏览器打开 http://localhost:3000/health，出现 {"status":"ok"} 即可。

实战：五条命令解决 90% 需求

“

下面所有命令都可以直接复制到终端运行。
把 your-secret-key 换成 .env 里写的 API_KEY。

1. 单页速读：把文章变成 Markdown

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://example.com/article",
    "options": { "extractorFormat": "markdown" }
  }' | jq -r '.content' > article.md

30 秒后，当前目录出现干净的 article.md，图片、标题、代码块都在。

2. 结构化抽取：让 AI 当“信息秘书”

假如你只想要“标题、作者、发布时间”，先写一个“小纸条”——JSON Schema：

curl -X POST http://localhost:3000/api/extract-schema \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://news.example.com/tech/123",
    "schema": {
      "type": "object",
      "properties": {
        "title":   { "type": "string", "description": "文章标题" },
        "author":  { "type": "string", "description": "作者姓名" },
        "publishDate": { "type": "string", "description": "发布日期，如 2024-07-21" }
      },
      "required": ["title"]
    }
  }' | jq -r '.extractedData'

{
  "title": "量子计算最新突破",
  "author": "李知行",
  "publishDate": "2024-07-21"
}

3. 批量“收割”：一次处理 50 个链接

把链接放进数组，调并发数，后台慢慢跑：

curl -X POST http://localhost:3000/api/batch/scrape \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "urls": [
      "https://docs.a.com/start",
      "https://docs.a.com/api",
      "https://docs.a.com/sdk"
    ],
    "concurrency": 3,
    "options": { "extractorFormat": "markdown" }
  }'

{
  "batchId": "550e8400...",
  "statusUrl": "http://localhost:3000/api/batch/scrape/550e8400.../status"
}

喝杯咖啡，回来就能打包下载：

curl "http://localhost:3000/api/batch/scrape/550e8400.../download/zip?format=markdown" \
  -H "X-API-Key: your-secret-key" \
  --output batch.zip

解压后得到：

1_start.md
2_api.md
3_sdk.md
batch_summary.json

4. 深度爬站：整站“一键归档”

把整站文档爬下来，自动按日期+标题命名文件：

curl -X POST http://localhost:3000/api/crawl \
  -H "X-API-Key: your-secret-key" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 100,
    "maxDepth": 2,
    "scrapeOptions": { "extractorFormat": "markdown" }
  }'

任务结束后，在 crawl-output/{job-id}/ 里会看到：

2024-07-21_abc123_docs.example.com_intro.md
2024-07-21_abc123_docs.example.com_api_auth.md
...
consolidated.md   # 全部内容合并版
consolidated.json # 结构化元数据

隐私与离线：数据不出你的电脑

如果你处理的是内部资料或敏感文档，DeepScrape 可以完全离线运行：

用 Ollama 在本地拉一个 7B 或 13B 的小模型：

docker run -d -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull llama3:latest

.env 指到本地：

LLM_PROVIDER=ollama
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=llama3:latest

之后所有请求都在本地完成，连日志都不会外泄。

常见疑问 Q&A

疑问	解答
我不懂代码，能图形界面吗？	目前以 REST API 为主，可配合 Postman；社区版 Web UI 已在路线图。
会不会被网站封？	默认开启 stealth mode，模拟真人行为；仍建议合理并发、遵守 robots.txt。
免费吗？	代码 Apache 2.0 协议，可商用；OpenAI 部分按 token 计费。
和 BeautifulSoup、Puppeteer 比如何？	低层同样用 Playwright，但 DeepScrape 内置 AI 抽取与批量管理，省去写解析规则的麻烦。

进阶技巧：让 AI 读论文、读手册

场景 1：对比三篇 arXiv 论文的方法论

给 AI 一张“信息卡”：

{
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "authors": {"type": "array", "items": {"type": "string"}},
    "methodology": {"type": "string"},
    "results": {"type": "string"},
    "keyContributions": {"type": "array", "items": {"type": "string"}}
  }
}

分别对三篇论文跑 /api/extract-schema，最后把 JSON 合并，就能一键生成“横向对比表”。

场景 2：把 GitHub 权限表变成内部速查表

官方文档又长又绕？让 AI 直接抽“接口 + 所需权限”：

{
  "apiEndpoints": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "endpoint": {"type": "string"},
        "requiredPermissions": {"type": "array", "items": {"type": "string"}}
      }
    }
  }
}

结果可直接塞进 Notion 或飞书多维表，团队再也不用来回翻网页。

路线图：下一步会更省力

浏览器池预热：启动更快。
自动写 Schema：告诉 AI “我要商品信息”，它帮你生成 JSON Schema。
可视化报告：批量任务结束后自动生成统计图表。

写在最后

DeepScrape 把“网页”与“数据”之间的鸿沟，用 AI 和自动化填平。
你不用再纠结正则、XPath、翻页逻辑，只需：

告诉它网址。
告诉它你想要什么。
拿结果。

剩下的时间，可以去做更有创造力的事——比如，基于这些数据写出一篇更有深度的报告。

“

如果本文帮到了你，欢迎把 DeepScrape 加星收藏；遇到具体问题，也欢迎在 GitHub 提 Issue，社区会一起帮你解答。

DeepScrape：三分钟实现网页到结构化数据的终极解决方案