WaterCrawl: A Powerful Web Crawling and Data Extraction Tool
In today’s digital age, data is akin to treasure, and the ability to effectively crawl and extract relevant data from海量 (massive) web pages has become a focus for many. WaterCrawl is such a powerful web application that leverages technologies like Python, Django, Scrapy, and Celery to help us efficiently complete web crawling and data extraction tasks. Let’s dive deep into what WaterCrawl offers.
Introduction to WaterCrawl
WaterCrawl is a feature-rich web application that acts as a diligent spider, rapidly navigating the ocean of the internet to crawl web pages and extract the data we need. Combining multiple technologies such as Python, Django, Scrapy, and Celery, it provides us with an efficient and stable solution for web crawling and data extraction.
Product Advantages
WaterCrawl boasts numerous remarkable advantages:
-
Advanced Web Crawling & Data Extraction: Capable of deep website crawling with highly customizable options. Adjust crawling depth, speed, and precisely target specific content based on your needs. -
Powerful Search Engine: Offers multiple search depth options including basic, advanced, and ultimate, helping you quickly find relevant content in the vast online world. -
Multi-language Support: Supports multiple languages for searching and crawling, along with country-specific content targeting to meet diverse regional and linguistic needs. -
Asynchronous Processing: Monitor crawling and search progress in real-time via Server-Sent Events (SSE), keeping everything under your control. -
Comprehensive REST API: Comes with detailed documentation and client libraries, making integration and secondary development a breeze for developers. -
Rich Ecosystem: Integrated with AI/automation platforms like Dify and N8N, and offers various plugins such as the WaterCrawl plugin and OpenAI plugin. -
Self-hosted & Open Source: Gives you full control over your data with simple deployment options for your own servers. -
Advanced Result Processing: Supports downloading and processing search results with customizable parameters for further data analysis.
Client SDKs and Integrations
WaterCrawl provides multiple client SDKs for developers using different programming languages:
-
Python Client: A full-featured SDK supporting all API endpoints, ideal for Python developers. -
Node.js Client: Enables complete JavaScript/TypeScript integration for Node.js developers. -
Go Client: A powerful SDK supporting all API endpoints to meet Go developers’ needs. -
PHP Client: Provides support for PHP developers to leverage WaterCrawl’s capabilities. -
Rust Client: Currently under development, worth looking forward to for Rust developers.
Additionally, WaterCrawl integrates with multiple platforms:
-
Dify Plugin: Available on the Dify platform, with source code publicly available on GitHub for developer review and modification. -
N8N Workflow Node: Features a corresponding workflow node on N8N, with open-source code for customization. -
Dify Knowledge Base: Integrated with Dify’s knowledge base for additional knowledge support. -
Langflow: Currently has a related Pull Request pending merge. -
Flowise: Upcoming integration, worth anticipating.
Quick Start with WaterCrawl
Fast Launch in Local Docker Environment
If you want to quickly start WaterCrawl in a local Docker environment, follow these steps:
-
Clone the Repository: Open the terminal and run the following command to clone the WaterCrawl repository:
git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl
This step brings the WaterCrawl code from the cloud to your local computer.
-
Build and Run Docker Containers: Continue executing the following commands in the terminal:
cd docker
cp .env.example .env
docker compose up -d
Here, cp .env.example .env
copies the sample environment file to the actual usage file, and docker compose up -d
starts the Docker containers to run WaterCrawl.
-
Access the Application: Open a browser and visit http://localhost to see the WaterCrawl interface.
Important Notice
If you deploy on a domain or IP address other than localhost
, you must update the MinIO configuration in the .env
file:
# Change this from 'localhost' to your actual domain or IP
MINIO_EXTERNAL_ENDPOINT=your-domain.com
# Also update these URLs accordingly
MINIO_BROWSER_REDIRECT_URL=http://your-domain.com/minio-console/
MINIO_SERVER_URL=http://your-domain.com/
Failing to update these settings may cause issues with file uploads and downloads. For more details, refer to DEPLOYMENT.md.
Before deploying to a production environment, ensure you update configuration values in the .env
file and set up and configure required services like the database and MinIO. Specific operations can be found in the Deployment Guide.
WaterCrawl Deployment Guide
Preparations Before Deployment
Before deploying WaterCrawl, ensure the following software is installed:
-
Docker Engine (20.10.0+): Docker is a containerization technology that helps quickly deploy and manage applications. -
Docker Compose (2.0.0+): Allows defining and running multiple Docker containers via a configuration file for easy application deployment. -
Git: Used to clone the WaterCrawl code repository.
Environment Configuration
WaterCrawl uses environment variables for configuration. All variables have default values in docker-compose.yml
, but you can override them in a .env
file. Let’s explore the environmental configurations in detail.
General Settings
General settings control basic Docker and version information:
Variable | Description | Default | Required? |
---|---|---|---|
VERSION |
Application version | v0.8.0 |
No |
NGINX_PORT |
Port for Nginx service | 80 |
No |
Setup Steps:
-
Determine which port you want to expose the application on. -
If port 80 is already in use, change NGINX_PORT
to another value like 8080.
Django Core Settings
These settings control the Django backend application:
Variable | Description | Default | Required? |
---|---|---|---|
SECRET_KEY |
Django security key | Long string | Yes for production |
API_ENCRYPTION_KEY |
API encryption key | Long string | Yes for production |
DEBUG |
Debug mode (set to False in production) | True |
No |
ALLOWED_HOSTS |
Comma-separated list of allowed hosts | * |
No |
LANGUAGE_CODE |
Language code | en-us |
No |
TIME_ZONE |
Time zone | UTC |
No |
USE_I18N |
Enable internationalization | True |
No |
USE_TZ |
Enable timezone support | True |
No |
STATIC_ROOT |
Static files directory | storage/static/ |
No |
MEDIA_ROOT |
Media files directory | storage/media/ |
No |
LOG_LEVEL |
Logging level | INFO |
No |
FRONTEND_URL |
Frontend URL for CORS and redirects | http://localhost |
No |
Setup Steps:
-
For production, generate a secure random SECRET_KEY
using:
openssl rand -base64 32
-
Set DEBUG=False
to enhance production environment security. -
Set ALLOWED_HOSTS
to your domain, e.g.,example.com,www.example.com
. -
Set TIME_ZONE
to your local time zone based on your region, such asEurope/Berlin
. -
Set FRONTEND_URL
to your frontend domain for email links and redirects.
Database Settings
Database settings control the PostgreSQL database:
Variable | Description | Default | Required? |
---|---|---|---|
POSTGRES_HOST |
PostgreSQL host | db |
No |
POSTGRES_PORT |
PostgreSQL port | 5432 |
No |
POSTGRES_PASSWORD |
PostgreSQL password | postgres |
Yes for production |
POSTGRES_USER |
PostgreSQL username | postgres |
No |
POSTGRES_DB |
PostgreSQL database name | postgres |
No |
Setup Steps:
-
In a production environment, set a strong POSTGRES_PASSWORD
to ensure database security. -
Default values are preconfigured to work with the included PostgreSQL container.
Redis Settings
Redis settings control Redis for caching and task queues:
Variable | Description | Default | Required? |
---|---|---|---|
CELERY_BROKER_URL |
Redis URL for Celery broker | redis://redis:6379/0 |
No |
REDIS_LOCKER_URL |
Redis URL for Django cache/locks | redis://redis:6379/3 |
No |
CELERY_RESULT_BACKEND |
Celery results backend | django-db |
No |
Setup Steps:
-
Default values work well with the bundled Redis service. -
Only change these if using an external Redis server.
JWT Settings
JWT settings control JSON Web Token authentication:
Variable | Description | Default | Required? |
---|---|---|---|
ACCESS_TOKEN_LIFETIME_MINUTES |
JWT access token lifetime in minutes | 5 |
No |
REFRESH_TOKEN_LIFETIME_DAYS |
JWT refresh token lifetime in days | 30 |
No |
Setup Steps:
-
Adjust token lifetimes based on your security requirements. -
Consider shorter lifetimes for more secure environments.
MinIO Settings
MinIO settings control MinIO object storage (S3-compatible):
Variable | Description | Default | Required? |
---|---|---|---|
MINIO_ENDPOINT |
MinIO endpoint for Django | minio:9000 |
No |
MINIO_EXTERNAL_ENDPOINT |
External MinIO endpoint | localhost |
Yes for production |
MINIO_REGION |
MinIO region (optional) | us-east-1 |
No |
MINIO_ACCESS_KEY |
MinIO access key (username) | minio |
Yes for production |
MINIO_SECRET_KEY |
MinIO secret key (password) | minio123 |
Yes for production |
MINIO_USE_HTTPS |
Use HTTPS for MinIO | False |
No |
MINIO_EXTERNAL_ENDPOINT_USE_HTTPS |
Use HTTPS for external endpoint | False |
No |
MINIO_URL_EXPIRY_HOURS |
MinIO URL expiry in hours | 7 |
No |
MINIO_CONSISTENCY_CHECK_ON_START |
Check consistency on startup | True |
No |
MINIO_PRIVATE_BUCKET |
Private bucket name | private |
No |
MINIO_PUBLIC_BUCKET |
Public bucket name | public |
No |
MINIO_BUCKET_CHECK_ON_SAVE |
Check bucket existence on save | False |
No |
MINIO_BROWSER_REDIRECT_URL |
MinIO browser redirect URL | http://localhost/minio-console/ |
No |
MINIO_SERVER_URL |
MinIO server URL | http://localhost/ |
No |
Setup Steps:
-
In production, set strong credentials for MINIO_ACCESS_KEY
andMINIO_SECRET_KEY
. -
When deploying to a domain other than localhost, must change MINIO_EXTERNAL_ENDPOINT
to your domain (e.g.,example.com
), as this variable controls presigned URL generation for file downloads/uploads. -
If using HTTPS, set MINIO_USE_HTTPS=True
andMINIO_EXTERNAL_ENDPOINT_USE_HTTPS=True
. -
Update MINIO_BROWSER_REDIRECT_URL
andMINIO_SERVER_URL
to match your domain.
CORS Settings
CORS settings control Cross-Origin Resource Sharing:
Variable | Description | Default | Required? |
---|---|---|---|
CSRF_TRUSTED_ORIGINS |
Trusted origins for CSRF | Empty | No |
CORS_ALLOWED_ORIGINS |
Allowed origins for CORS | Empty | No |
CORS_ALLOWED_ORIGIN_REGEXES |
Regexes for CORS origins | Empty | No |
CORS_ALLOW_ALL_ORIGINS |
Allow all origins | False |
No |
Setup Steps:
-
In production, add your domain to CSRF_TRUSTED_ORIGINS
andCORS_ALLOWED_ORIGINS
, e.g.,CSRF_TRUSTED_ORIGINS=https://example.com,https://www.example.com
. -
Avoid setting CORS_ALLOW_ALL_ORIGINS=True
in production for security.
Authentication Settings
Authentication settings control user authentication:
Variable | Description | Default | Required? |
---|---|---|---|
IS_ENTERPRISE_MODE_ACTIVE |
Enterprise mode | False |
No |
IS_LOGIN_ACTIVE |
Enable login functionality | True |
No |
IS_SIGNUP_ACTIVE |
Enable signup functionality | False |
No |
IS_GITHUB_LOGIN_ACTIVE |
Enable GitHub login | False |
No |
IS_GOOGLE_LOGIN_ACTIVE |
Enable Google login | False |
No |
GITHUB_CLIENT_ID |
GitHub OAuth client ID | Empty | Required for GitHub login |
GITHUB_CLIENT_SECRET |
GitHub OAuth client secret | Empty | Required for GitHub login |
GOOGLE_CLIENT_ID |
Google OAuth client ID | Empty | Required for Google login |
GOOGLE_CLIENT_SECRET |
Google OAuth client secret | Empty | Required for Google login |
Setup Steps:
-
Note that signup is disabled by default ( IS_SIGNUP_ACTIVE=False
). -
Social logins are disabled by default. -
For GitHub social login: -
Create an OAuth app at https://github.com/settings/developers. -
Set the callback URL to http://your-domain.com/api/auth/github/callback/
. -
Add client ID and secret to environment variables.
-
-
For Google social login: -
Create credentials at https://console.developers.google.com. -
Set authorized redirect URI to http://your-domain.com/api/auth/google/callback/
. -
Add client ID and secret to environment variables.
-
Email Settings
Email settings control email sending:
Variable | Description | Default | Required? |
---|---|---|---|
EMAIL_BACKEND |
Email backend | Django SMTP Backend | No |
EMAIL_HOST |
SMTP host | Empty | Required for email |
EMAIL_PORT |
SMTP port | 587 |
No |
EMAIL_USE_TLS |
Use TLS for SMTP | True |
No |
EMAIL_HOST_USER |
SMTP username | Empty | Required for email |
EMAIL_HOST_PASSWORD |
SMTP password | Empty | Required for email |
DEFAULT_FROM_EMAIL |
Default from email | Empty | Required for email |
Setup Steps:
-
In production, set up a proper email service. -
For testing, use services like Mailhog or the built-in Postfix container. -
If using Gmail, generate an App Password.
Scrapy Settings
Scrapy settings control web scraping with Scrapy:
Variable | Description | Default | Required? |
---|---|---|---|
SCRAPY_USER_AGENT |
User agent for scraping | WaterCrawl/0.5.0 (+https://github.com/watercrawl/watercrawl) |
No |
SCRAPY_ROBOTSTXT_OBEY |
Obey robots.txt rules | True |
No |
SCRAPY_CONCURRENT_REQUESTS |
Concurrent requests | 16 |
No |
SCRAPY_DOWNLOAD_DELAY |
Download delay (seconds) | 0 |
No |
SCRAPY_CONCURRENT_REQUESTS_PER_DOMAIN |
Requests per domain | 4 |
No |
SCRAPY_CONCURRENT_REQUESTS_PER_IP |
Requests per IP | 4 |
No |
SCRAPY_COOKIES_ENABLED |
Enable cookies | False |
No |
SCRAPY_HTTPCACHE_ENABLED |
Enable HTTP cache | True |
No |
SCRAPY_HTTPCACHE_EXPIRATION_SECS |
HTTP cache expiration | 3600 |
No |
SCRAPY_HTTPCACHE_DIR |
HTTP cache directory | httpcache |
No |
SCRAPY_LOG_LEVEL |
Scrapy log level | ERROR |
No |
Setup Steps:
-
Adjust these settings based on your scraping needs. -
Increase concurrent requests for more aggressive scraping (but be respectful of website rules). -
Add a download delay for more polite scraping.
Playwright Settings
Playwright settings control browser automation with Playwright:
Variable | Description | Default | Required? |
---|---|---|---|
PLAYWRIGHT_SERVER |
Playwright server URL | http://playwright:8000 |
No |
PLAYWRIGHT_API_KEY |
Playwright API key | your-secret-api-key |
Yes for production |
PORT |
Playwright service port | 8000 |
No |
HOST |
Playwright service host | 0.0.0.0 |
No |
Setup Steps:
-
In production, set a strong PLAYWRIGHT_API_KEY
for authentication between services.
Integration Settings
Integration settings control third-party integrations:
Variable | Description | Default | Required? |
---|---|---|---|
OPENAI_API_KEY |
OpenAI API key | Empty | Required for AI features |
STRIPE_SECRET_KEY |
Stripe secret key | Empty | Required for payments |
STRIPE_WEBHOOK_SECRET |
Stripe webhook secret | Empty | Required for Stripe webhooks |
GOOGLE_ANALYTICS_ID |
Google Analytics ID | Empty | Optional |
Setup Steps:
-
Get an API key from OpenAI for AI features. -
Set up a Stripe account for payment processing. -
Configure your webhook endpoint in the Stripe dashboard for Stripe webhooks.
Feature Flags
Feature flags control feature availability:
Variable | Description | Default | Required? |
---|---|---|---|
MAX_CRAWL_DEPTH |
Maximum crawl depth | -1 (unlimited) |
No |
CAPTURE_USAGE_HISTORY |
Capture usage history | True |
No |
Setup Steps:
-
Set a positive integer to limit crawl depth. -
Set CAPTURE_USAGE_HISTORY=False
to disable usage tracking.
Frontend Settings
Frontend settings control the React frontend:
Variable | Description | Default | Required? |
---|---|---|---|
API_BASE_URL |
API base URL for frontend | /api |
No |
Setup Steps:
-
The default value /api
works with the Nginx configuration. -
You can use an absolute URL (e.g., http://localhost/api
) or a relative URL (e.g.,/api
).
Deployment Steps
Follow these steps to deploy WaterCrawl:
-
Clone the Repository:
git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl
This step downloads the WaterCrawl code to your local machine.
-
Create Environment File:
cp docker/.env.example docker/.env
Copy the sample environment file to the actual usage file.
-
Edit the Environment File: Set required variables, for example:
# At minimum for production, set these:
SECRET_KEY="your-generated-secret-key"
API_ENCRYPTION_KEY="your-generated-api-encryption-key" # Generate a new one using `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`
DEBUG=False
ALLOWED_HOSTS=your-domain.com
POSTGRES_PASSWORD=your-strong-password
MINIO_ACCESS_KEY=your-minio-username
MINIO_SECRET_KEY=your-minio-password
MINIO_EXTERNAL_ENDPOINT=your-domain.com # CRITICAL: Set to your domain
PLAYWRIGHT_API_KEY=your-strong-api-key
Note that SECRET_KEY
and API_ENCRYPTION_KEY
require generating secure random values, and POSTGRES_PASSWORD
, MINIO_ACCESS_KEY
, MINIO_SECRET_KEY
, and PLAYWRIGHT_API_KEY
should be set to strong passwords.
-
Start Services:
cd docker
docker-compose up -d
Start Docker containers to run WaterCrawl.
-
Initialize the Database (first run only):
docker-compose exec app python manage.py migrate
docker-compose exec app python manage.py createsuperuser
python manage.py migrate
creates database tables, and python manage.py createsuperuser
creates a superuser for management.
-
Access the Application:
After deployment, access different services via the following URLs:
-
Frontend: http://your-domain.com -
API: http://your-domain.com/api -
MinIO Console: http://your-domain.com/minio-console
Frequently Asked Questions
Connection Issues
-
Cannot connect to a service: Use docker-compose logs <service-name>
to view Docker logs and understand specific error messages. -
Database connection error: Ensure PostgreSQL is running with docker-compose ps
. -
Frontend not loading: Check JavaScript errors in the browser console to identify the issue.
Data Persistence Issues
-
Data lost after restart: Ensure Docker volumes are correctly configured to retain data after container restarts. -
Cannot upload files: Check MinIO credentials and bucket configuration for correctness.
Performance Issues
-
Slow response times: Use docker stats
to check resource usage and identify performance bottlenecks. -
Memory issues: Adjust MinIO’s JVM settings or Gunicorn’s worker counts to optimize memory usage.
If you encounter other issues, you can submit an issue on the GitHub repository for further assistance.
Conclusion
WaterCrawl is a powerful and easy-to-deploy web crawling and data extraction tool that offers rich features and customizable options to meet the needs of different users. Through this article, we have learned about WaterCrawl’s characteristics, quick start methods, deployment guides, and solutions to common problems. We hope you can use WaterCrawl reasonably according to your needs to mine more valuable data.
When using WaterCrawl, pay attention to correctly configuring environment variables according to different environments and needs to ensure the secure and stable operation of the service. At the same time, if you encounter problems, do not panic—you can solve them by checking logs, verifying configurations, or seeking help in the community. Believe that WaterCrawl will become your powerful assistant in data crawling and analysis.