WaterCrawl: A Powerful Web Crawling and Data Extraction Tool

In today’s digital age, data is akin to treasure, and the ability to effectively crawl and extract relevant data from海量 (massive) web pages has become a focus for many. WaterCrawl is such a powerful web application that leverages technologies like Python, Django, Scrapy, and Celery to help us efficiently complete web crawling and data extraction tasks. Let’s dive deep into what WaterCrawl offers.

Introduction to WaterCrawl

WaterCrawl is a feature-rich web application that acts as a diligent spider, rapidly navigating the ocean of the internet to crawl web pages and extract the data we need. Combining multiple technologies such as Python, Django, Scrapy, and Celery, it provides us with an efficient and stable solution for web crawling and data extraction.

Product Advantages

WaterCrawl boasts numerous remarkable advantages:

Advanced Web Crawling & Data Extraction: Capable of deep website crawling with highly customizable options. Adjust crawling depth, speed, and precisely target specific content based on your needs.
Powerful Search Engine: Offers multiple search depth options including basic, advanced, and ultimate, helping you quickly find relevant content in the vast online world.
Multi-language Support: Supports multiple languages for searching and crawling, along with country-specific content targeting to meet diverse regional and linguistic needs.
Asynchronous Processing: Monitor crawling and search progress in real-time via Server-Sent Events (SSE), keeping everything under your control.
Comprehensive REST API: Comes with detailed documentation and client libraries, making integration and secondary development a breeze for developers.
Rich Ecosystem: Integrated with AI/automation platforms like Dify and N8N, and offers various plugins such as the WaterCrawl plugin and OpenAI plugin.
Self-hosted & Open Source: Gives you full control over your data with simple deployment options for your own servers.
Advanced Result Processing: Supports downloading and processing search results with customizable parameters for further data analysis.

Client SDKs and Integrations

WaterCrawl provides multiple client SDKs for developers using different programming languages:

Python Client: A full-featured SDK supporting all API endpoints, ideal for Python developers.
Node.js Client: Enables complete JavaScript/TypeScript integration for Node.js developers.
Go Client: A powerful SDK supporting all API endpoints to meet Go developers’ needs.
PHP Client: Provides support for PHP developers to leverage WaterCrawl’s capabilities.
Rust Client: Currently under development, worth looking forward to for Rust developers.

Additionally, WaterCrawl integrates with multiple platforms:

Dify Plugin: Available on the Dify platform, with source code publicly available on GitHub for developer review and modification.
N8N Workflow Node: Features a corresponding workflow node on N8N, with open-source code for customization.
Dify Knowledge Base: Integrated with Dify’s knowledge base for additional knowledge support.
Langflow: Currently has a related Pull Request pending merge.
Flowise: Upcoming integration, worth anticipating.

Quick Start with WaterCrawl

Fast Launch in Local Docker Environment

If you want to quickly start WaterCrawl in a local Docker environment, follow these steps:

Clone the Repository: Open the terminal and run the following command to clone the WaterCrawl repository:

git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl

This step brings the WaterCrawl code from the cloud to your local computer.

Build and Run Docker Containers: Continue executing the following commands in the terminal:

cd docker
cp .env.example .env
docker compose up -d

Here, cp .env.example .env copies the sample environment file to the actual usage file, and docker compose up -d starts the Docker containers to run WaterCrawl.

Access the Application: Open a browser and visit http://localhost to see the WaterCrawl interface.

Important Notice

If you deploy on a domain or IP address other than localhost, you must update the MinIO configuration in the .env file:

# Change this from 'localhost' to your actual domain or IP
MINIO_EXTERNAL_ENDPOINT=your-domain.com

# Also update these URLs accordingly
MINIO_BROWSER_REDIRECT_URL=http://your-domain.com/minio-console/
MINIO_SERVER_URL=http://your-domain.com/

Failing to update these settings may cause issues with file uploads and downloads. For more details, refer to DEPLOYMENT.md.

Before deploying to a production environment, ensure you update configuration values in the .env file and set up and configure required services like the database and MinIO. Specific operations can be found in the Deployment Guide.

WaterCrawl Deployment Guide

Preparations Before Deployment

Before deploying WaterCrawl, ensure the following software is installed:

Docker Engine (20.10.0+): Docker is a containerization technology that helps quickly deploy and manage applications.
Docker Compose (2.0.0+): Allows defining and running multiple Docker containers via a configuration file for easy application deployment.
Git: Used to clone the WaterCrawl code repository.

Environment Configuration

WaterCrawl uses environment variables for configuration. All variables have default values in docker-compose.yml, but you can override them in a .env file. Let’s explore the environmental configurations in detail.

General Settings

General settings control basic Docker and version information:

Variable	Description	Default	Required?
`VERSION`	Application version	`v0.8.0`	No
`NGINX_PORT`	Port for Nginx service	`80`	No

Setup Steps:

Determine which port you want to expose the application on.
If port 80 is already in use, change NGINX_PORT to another value like 8080.

Django Core Settings

These settings control the Django backend application:

Variable	Description	Default	Required?
`SECRET_KEY`	Django security key	Long string	Yes for production
`API_ENCRYPTION_KEY`	API encryption key	Long string	Yes for production
`DEBUG`	Debug mode (set to False in production)	`True`	No
`ALLOWED_HOSTS`	Comma-separated list of allowed hosts	`*`	No
`LANGUAGE_CODE`	Language code	`en-us`	No
`TIME_ZONE`	Time zone	`UTC`	No
`USE_I18N`	Enable internationalization	`True`	No
`USE_TZ`	Enable timezone support	`True`	No
`STATIC_ROOT`	Static files directory	`storage/static/`	No
`MEDIA_ROOT`	Media files directory	`storage/media/`	No
`LOG_LEVEL`	Logging level	`INFO`	No
`FRONTEND_URL`	Frontend URL for CORS and redirects	`http://localhost`	No

Setup Steps:

For production, generate a secure random SECRET_KEY using:

openssl rand -base64 32

Set DEBUG=False to enhance production environment security.
Set ALLOWED_HOSTS to your domain, e.g., example.com,www.example.com.
Set TIME_ZONE to your local time zone based on your region, such as Europe/Berlin.
Set FRONTEND_URL to your frontend domain for email links and redirects.

Database Settings

Database settings control the PostgreSQL database:

Variable	Description	Default	Required?
`POSTGRES_HOST`	PostgreSQL host	`db`	No
`POSTGRES_PORT`	PostgreSQL port	`5432`	No
`POSTGRES_PASSWORD`	PostgreSQL password	`postgres`	Yes for production
`POSTGRES_USER`	PostgreSQL username	`postgres`	No
`POSTGRES_DB`	PostgreSQL database name	`postgres`	No

Setup Steps:

In a production environment, set a strong POSTGRES_PASSWORD to ensure database security.
Default values are preconfigured to work with the included PostgreSQL container.

Redis Settings

Redis settings control Redis for caching and task queues:

Variable	Description	Default	Required?
`CELERY_BROKER_URL`	Redis URL for Celery broker	`redis://redis:6379/0`	No
`REDIS_LOCKER_URL`	Redis URL for Django cache/locks	`redis://redis:6379/3`	No
`CELERY_RESULT_BACKEND`	Celery results backend	`django-db`	No

Setup Steps:

Default values work well with the bundled Redis service.
Only change these if using an external Redis server.

JWT Settings

JWT settings control JSON Web Token authentication:

Variable	Description	Default	Required?
`ACCESS_TOKEN_LIFETIME_MINUTES`	JWT access token lifetime in minutes	`5`	No
`REFRESH_TOKEN_LIFETIME_DAYS`	JWT refresh token lifetime in days	`30`	No

Setup Steps:

Adjust token lifetimes based on your security requirements.
Consider shorter lifetimes for more secure environments.

MinIO Settings

MinIO settings control MinIO object storage (S3-compatible):

Variable	Description	Default	Required?
`MINIO_ENDPOINT`	MinIO endpoint for Django	`minio:9000`	No
`MINIO_EXTERNAL_ENDPOINT`	External MinIO endpoint	`localhost`	Yes for production
`MINIO_REGION`	MinIO region (optional)	`us-east-1`	No
`MINIO_ACCESS_KEY`	MinIO access key (username)	`minio`	Yes for production
`MINIO_SECRET_KEY`	MinIO secret key (password)	`minio123`	Yes for production
`MINIO_USE_HTTPS`	Use HTTPS for MinIO	`False`	No
`MINIO_EXTERNAL_ENDPOINT_USE_HTTPS`	Use HTTPS for external endpoint	`False`	No
`MINIO_URL_EXPIRY_HOURS`	MinIO URL expiry in hours	`7`	No
`MINIO_CONSISTENCY_CHECK_ON_START`	Check consistency on startup	`True`	No
`MINIO_PRIVATE_BUCKET`	Private bucket name	`private`	No
`MINIO_PUBLIC_BUCKET`	Public bucket name	`public`	No
`MINIO_BUCKET_CHECK_ON_SAVE`	Check bucket existence on save	`False`	No
`MINIO_BROWSER_REDIRECT_URL`	MinIO browser redirect URL	`http://localhost/minio-console/`	No
`MINIO_SERVER_URL`	MinIO server URL	`http://localhost/`	No

Setup Steps:

In production, set strong credentials for MINIO_ACCESS_KEY and MINIO_SECRET_KEY.
When deploying to a domain other than localhost, must change MINIO_EXTERNAL_ENDPOINT to your domain (e.g., example.com), as this variable controls presigned URL generation for file downloads/uploads.
If using HTTPS, set MINIO_USE_HTTPS=True and MINIO_EXTERNAL_ENDPOINT_USE_HTTPS=True.
Update MINIO_BROWSER_REDIRECT_URL and MINIO_SERVER_URL to match your domain.

CORS Settings

CORS settings control Cross-Origin Resource Sharing:

Variable	Description	Default	Required?
`CSRF_TRUSTED_ORIGINS`	Trusted origins for CSRF	Empty	No
`CORS_ALLOWED_ORIGINS`	Allowed origins for CORS	Empty	No
`CORS_ALLOWED_ORIGIN_REGEXES`	Regexes for CORS origins	Empty	No
`CORS_ALLOW_ALL_ORIGINS`	Allow all origins	`False`	No

Setup Steps:

In production, add your domain to CSRF_TRUSTED_ORIGINS and CORS_ALLOWED_ORIGINS, e.g., CSRF_TRUSTED_ORIGINS=https://example.com,https://www.example.com.
Avoid setting CORS_ALLOW_ALL_ORIGINS=True in production for security.

Authentication Settings

Authentication settings control user authentication:

Variable	Description	Default	Required?
`IS_ENTERPRISE_MODE_ACTIVE`	Enterprise mode	`False`	No
`IS_LOGIN_ACTIVE`	Enable login functionality	`True`	No
`IS_SIGNUP_ACTIVE`	Enable signup functionality	`False`	No
`IS_GITHUB_LOGIN_ACTIVE`	Enable GitHub login	`False`	No
`IS_GOOGLE_LOGIN_ACTIVE`	Enable Google login	`False`	No
`GITHUB_CLIENT_ID`	GitHub OAuth client ID	Empty	Required for GitHub login
`GITHUB_CLIENT_SECRET`	GitHub OAuth client secret	Empty	Required for GitHub login
`GOOGLE_CLIENT_ID`	Google OAuth client ID	Empty	Required for Google login
`GOOGLE_CLIENT_SECRET`	Google OAuth client secret	Empty	Required for Google login

Setup Steps:

Note that signup is disabled by default (IS_SIGNUP_ACTIVE=False).
Social logins are disabled by default.
For GitHub social login:
- Create an OAuth app at https://github.com/settings/developers.
- Set the callback URL to http://your-domain.com/api/auth/github/callback/.
- Add client ID and secret to environment variables.
For Google social login:
- Create credentials at https://console.developers.google.com.
- Set authorized redirect URI to http://your-domain.com/api/auth/google/callback/.
- Add client ID and secret to environment variables.

Email Settings

Email settings control email sending:

Variable	Description	Default	Required?
`EMAIL_BACKEND`	Email backend	Django SMTP Backend	No
`EMAIL_HOST`	SMTP host	Empty	Required for email
`EMAIL_PORT`	SMTP port	`587`	No
`EMAIL_USE_TLS`	Use TLS for SMTP	`True`	No
`EMAIL_HOST_USER`	SMTP username	Empty	Required for email
`EMAIL_HOST_PASSWORD`	SMTP password	Empty	Required for email
`DEFAULT_FROM_EMAIL`	Default from email	Empty	Required for email

Setup Steps:

In production, set up a proper email service.
For testing, use services like Mailhog or the built-in Postfix container.
If using Gmail, generate an App Password.

Scrapy Settings

Scrapy settings control web scraping with Scrapy:

Variable	Description	Default	Required?
`SCRAPY_USER_AGENT`	User agent for scraping	`WaterCrawl/0.5.0 (+https://github.com/watercrawl/watercrawl)`	No
`SCRAPY_ROBOTSTXT_OBEY`	Obey robots.txt rules	`True`	No
`SCRAPY_CONCURRENT_REQUESTS`	Concurrent requests	`16`	No
`SCRAPY_DOWNLOAD_DELAY`	Download delay (seconds)	`0`	No
`SCRAPY_CONCURRENT_REQUESTS_PER_DOMAIN`	Requests per domain	`4`	No
`SCRAPY_CONCURRENT_REQUESTS_PER_IP`	Requests per IP	`4`	No
`SCRAPY_COOKIES_ENABLED`	Enable cookies	`False`	No
`SCRAPY_HTTPCACHE_ENABLED`	Enable HTTP cache	`True`	No
`SCRAPY_HTTPCACHE_EXPIRATION_SECS`	HTTP cache expiration	`3600`	No
`SCRAPY_HTTPCACHE_DIR`	HTTP cache directory	`httpcache`	No
`SCRAPY_LOG_LEVEL`	Scrapy log level	`ERROR`	No

Setup Steps:

Adjust these settings based on your scraping needs.
Increase concurrent requests for more aggressive scraping (but be respectful of website rules).
Add a download delay for more polite scraping.

Playwright Settings

Playwright settings control browser automation with Playwright:

Variable	Description	Default	Required?
`PLAYWRIGHT_SERVER`	Playwright server URL	`http://playwright:8000`	No
`PLAYWRIGHT_API_KEY`	Playwright API key	`your-secret-api-key`	Yes for production
`PORT`	Playwright service port	`8000`	No
`HOST`	Playwright service host	`0.0.0.0`	No

Setup Steps:

In production, set a strong PLAYWRIGHT_API_KEY for authentication between services.

Integration Settings

Integration settings control third-party integrations:

Variable	Description	Default	Required?
`OPENAI_API_KEY`	OpenAI API key	Empty	Required for AI features
`STRIPE_SECRET_KEY`	Stripe secret key	Empty	Required for payments
`STRIPE_WEBHOOK_SECRET`	Stripe webhook secret	Empty	Required for Stripe webhooks
`GOOGLE_ANALYTICS_ID`	Google Analytics ID	Empty	Optional

Setup Steps:

Get an API key from OpenAI for AI features.
Set up a Stripe account for payment processing.
Configure your webhook endpoint in the Stripe dashboard for Stripe webhooks.

Feature Flags

Feature flags control feature availability:

Variable	Description	Default	Required?
`MAX_CRAWL_DEPTH`	Maximum crawl depth	`-1` (unlimited)	No
`CAPTURE_USAGE_HISTORY`	Capture usage history	`True`	No

Setup Steps:

Set a positive integer to limit crawl depth.
Set CAPTURE_USAGE_HISTORY=False to disable usage tracking.

Frontend Settings

Frontend settings control the React frontend:

Variable	Description	Default	Required?
`API_BASE_URL`	API base URL for frontend	`/api`	No

Setup Steps:

The default value /api works with the Nginx configuration.
You can use an absolute URL (e.g., http://localhost/api) or a relative URL (e.g., /api).

Deployment Steps

Follow these steps to deploy WaterCrawl:

Clone the Repository:

git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl

This step downloads the WaterCrawl code to your local machine.

Create Environment File:

cp docker/.env.example docker/.env

Copy the sample environment file to the actual usage file.

Edit the Environment File: Set required variables, for example:

# At minimum for production, set these:
SECRET_KEY="your-generated-secret-key"
API_ENCRYPTION_KEY="your-generated-api-encryption-key" # Generate a new one using `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`
DEBUG=False
ALLOWED_HOSTS=your-domain.com
POSTGRES_PASSWORD=your-strong-password
MINIO_ACCESS_KEY=your-minio-username
MINIO_SECRET_KEY=your-minio-password
MINIO_EXTERNAL_ENDPOINT=your-domain.com  # CRITICAL: Set to your domain
PLAYWRIGHT_API_KEY=your-strong-api-key

Note that SECRET_KEY and API_ENCRYPTION_KEY require generating secure random values, and POSTGRES_PASSWORD, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and PLAYWRIGHT_API_KEY should be set to strong passwords.

Start Services:

cd docker
docker-compose up -d

Start Docker containers to run WaterCrawl.

Initialize the Database (first run only):

docker-compose exec app python manage.py migrate
docker-compose exec app python manage.py createsuperuser

python manage.py migrate creates database tables, and python manage.py createsuperuser creates a superuser for management.

Access the Application:
After deployment, access different services via the following URLs:

Frontend: http://your-domain.com
API: http://your-domain.com/api
MinIO Console: http://your-domain.com/minio-console

Frequently Asked Questions

Connection Issues

Cannot connect to a service: Use docker-compose logs <service-name> to view Docker logs and understand specific error messages.
Database connection error: Ensure PostgreSQL is running with docker-compose ps.
Frontend not loading: Check JavaScript errors in the browser console to identify the issue.

Data Persistence Issues

Data lost after restart: Ensure Docker volumes are correctly configured to retain data after container restarts.
Cannot upload files: Check MinIO credentials and bucket configuration for correctness.

Performance Issues

Slow response times: Use docker stats to check resource usage and identify performance bottlenecks.
Memory issues: Adjust MinIO’s JVM settings or Gunicorn’s worker counts to optimize memory usage.

If you encounter other issues, you can submit an issue on the GitHub repository for further assistance.

Conclusion

WaterCrawl is a powerful and easy-to-deploy web crawling and data extraction tool that offers rich features and customizable options to meet the needs of different users. Through this article, we have learned about WaterCrawl’s characteristics, quick start methods, deployment guides, and solutions to common problems. We hope you can use WaterCrawl reasonably according to your needs to mine more valuable data.

When using WaterCrawl, pay attention to correctly configuring environment variables according to different environments and needs to ensure the secure and stable operation of the service. At the same time, if you encounter problems, do not panic—you can solve them by checking logs, verifying configurations, or seeking help in the community. Believe that WaterCrawl will become your powerful assistant in data crawling and analysis.

WaterCrawl Web Crawling Tool: The Ultimate Solution for Advanced Data Extraction

WaterCrawl: A Powerful Web Crawling and Data Extraction Tool

Introduction to WaterCrawl

Product Advantages

Client SDKs and Integrations

Quick Start with WaterCrawl

Fast Launch in Local Docker Environment

Important Notice

WaterCrawl Deployment Guide

Preparations Before Deployment

Environment Configuration

General Settings

Django Core Settings

Database Settings

Redis Settings

JWT Settings

MinIO Settings

CORS Settings

Authentication Settings

Email Settings

Scrapy Settings

Playwright Settings

Integration Settings

Feature Flags

Frontend Settings

Deployment Steps

Frequently Asked Questions

Connection Issues

Data Persistence Issues

Performance Issues

Conclusion

Related Posts