WaterCrawl: A Powerful Web Crawling and Data Extraction Tool

In today’s digital age, data is akin to treasure, and the ability to effectively crawl and extract relevant data from海量 (massive) web pages has become a focus for many. WaterCrawl is such a powerful web application that leverages technologies like Python, Django, Scrapy, and Celery to help us efficiently complete web crawling and data extraction tasks. Let’s dive deep into what WaterCrawl offers.

Introduction to WaterCrawl

WaterCrawl is a feature-rich web application that acts as a diligent spider, rapidly navigating the ocean of the internet to crawl web pages and extract the data we need. Combining multiple technologies such as Python, Django, Scrapy, and Celery, it provides us with an efficient and stable solution for web crawling and data extraction.

Product Advantages

WaterCrawl boasts numerous remarkable advantages:

  • Advanced Web Crawling & Data Extraction: Capable of deep website crawling with highly customizable options. Adjust crawling depth, speed, and precisely target specific content based on your needs.
  • Powerful Search Engine: Offers multiple search depth options including basic, advanced, and ultimate, helping you quickly find relevant content in the vast online world.
  • Multi-language Support: Supports multiple languages for searching and crawling, along with country-specific content targeting to meet diverse regional and linguistic needs.
  • Asynchronous Processing: Monitor crawling and search progress in real-time via Server-Sent Events (SSE), keeping everything under your control.
  • Comprehensive REST API: Comes with detailed documentation and client libraries, making integration and secondary development a breeze for developers.
  • Rich Ecosystem: Integrated with AI/automation platforms like Dify and N8N, and offers various plugins such as the WaterCrawl plugin and OpenAI plugin.
  • Self-hosted & Open Source: Gives you full control over your data with simple deployment options for your own servers.
  • Advanced Result Processing: Supports downloading and processing search results with customizable parameters for further data analysis.

Client SDKs and Integrations

WaterCrawl provides multiple client SDKs for developers using different programming languages:

  • Python Client: A full-featured SDK supporting all API endpoints, ideal for Python developers.
  • Node.js Client: Enables complete JavaScript/TypeScript integration for Node.js developers.
  • Go Client: A powerful SDK supporting all API endpoints to meet Go developers’ needs.
  • PHP Client: Provides support for PHP developers to leverage WaterCrawl’s capabilities.
  • Rust Client: Currently under development, worth looking forward to for Rust developers.

Additionally, WaterCrawl integrates with multiple platforms:

  • Dify Plugin: Available on the Dify platform, with source code publicly available on GitHub for developer review and modification.
  • N8N Workflow Node: Features a corresponding workflow node on N8N, with open-source code for customization.
  • Dify Knowledge Base: Integrated with Dify’s knowledge base for additional knowledge support.
  • Langflow: Currently has a related Pull Request pending merge.
  • Flowise: Upcoming integration, worth anticipating.

Quick Start with WaterCrawl

Fast Launch in Local Docker Environment

If you want to quickly start WaterCrawl in a local Docker environment, follow these steps:

  1. Clone the Repository: Open the terminal and run the following command to clone the WaterCrawl repository:
git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl

This step brings the WaterCrawl code from the cloud to your local computer.

  1. Build and Run Docker Containers: Continue executing the following commands in the terminal:
cd docker
cp .env.example .env
docker compose up -d

Here, cp .env.example .env copies the sample environment file to the actual usage file, and docker compose up -d starts the Docker containers to run WaterCrawl.

  1. Access the Application: Open a browser and visit http://localhost to see the WaterCrawl interface.

Important Notice

If you deploy on a domain or IP address other than localhost, you must update the MinIO configuration in the .env file:

# Change this from 'localhost' to your actual domain or IP
MINIO_EXTERNAL_ENDPOINT=your-domain.com

# Also update these URLs accordingly
MINIO_BROWSER_REDIRECT_URL=http://your-domain.com/minio-console/
MINIO_SERVER_URL=http://your-domain.com/

Failing to update these settings may cause issues with file uploads and downloads. For more details, refer to DEPLOYMENT.md.

Before deploying to a production environment, ensure you update configuration values in the .env file and set up and configure required services like the database and MinIO. Specific operations can be found in the Deployment Guide.

WaterCrawl Deployment Guide

Preparations Before Deployment

Before deploying WaterCrawl, ensure the following software is installed:

  • Docker Engine (20.10.0+): Docker is a containerization technology that helps quickly deploy and manage applications.
  • Docker Compose (2.0.0+): Allows defining and running multiple Docker containers via a configuration file for easy application deployment.
  • Git: Used to clone the WaterCrawl code repository.

Environment Configuration

WaterCrawl uses environment variables for configuration. All variables have default values in docker-compose.yml, but you can override them in a .env file. Let’s explore the environmental configurations in detail.

General Settings

General settings control basic Docker and version information:

Variable Description Default Required?
VERSION Application version v0.8.0 No
NGINX_PORT Port for Nginx service 80 No

Setup Steps:

  1. Determine which port you want to expose the application on.
  2. If port 80 is already in use, change NGINX_PORT to another value like 8080.

Django Core Settings

These settings control the Django backend application:

Variable Description Default Required?
SECRET_KEY Django security key Long string Yes for production
API_ENCRYPTION_KEY API encryption key Long string Yes for production
DEBUG Debug mode (set to False in production) True No
ALLOWED_HOSTS Comma-separated list of allowed hosts * No
LANGUAGE_CODE Language code en-us No
TIME_ZONE Time zone UTC No
USE_I18N Enable internationalization True No
USE_TZ Enable timezone support True No
STATIC_ROOT Static files directory storage/static/ No
MEDIA_ROOT Media files directory storage/media/ No
LOG_LEVEL Logging level INFO No
FRONTEND_URL Frontend URL for CORS and redirects http://localhost No

Setup Steps:

  1. For production, generate a secure random SECRET_KEY using:
openssl rand -base64 32
  1. Set DEBUG=False to enhance production environment security.
  2. Set ALLOWED_HOSTS to your domain, e.g., example.com,www.example.com.
  3. Set TIME_ZONE to your local time zone based on your region, such as Europe/Berlin.
  4. Set FRONTEND_URL to your frontend domain for email links and redirects.

Database Settings

Database settings control the PostgreSQL database:

Variable Description Default Required?
POSTGRES_HOST PostgreSQL host db No
POSTGRES_PORT PostgreSQL port 5432 No
POSTGRES_PASSWORD PostgreSQL password postgres Yes for production
POSTGRES_USER PostgreSQL username postgres No
POSTGRES_DB PostgreSQL database name postgres No

Setup Steps:

  1. In a production environment, set a strong POSTGRES_PASSWORD to ensure database security.
  2. Default values are preconfigured to work with the included PostgreSQL container.

Redis Settings

Redis settings control Redis for caching and task queues:

Variable Description Default Required?
CELERY_BROKER_URL Redis URL for Celery broker redis://redis:6379/0 No
REDIS_LOCKER_URL Redis URL for Django cache/locks redis://redis:6379/3 No
CELERY_RESULT_BACKEND Celery results backend django-db No

Setup Steps:

  1. Default values work well with the bundled Redis service.
  2. Only change these if using an external Redis server.

JWT Settings

JWT settings control JSON Web Token authentication:

Variable Description Default Required?
ACCESS_TOKEN_LIFETIME_MINUTES JWT access token lifetime in minutes 5 No
REFRESH_TOKEN_LIFETIME_DAYS JWT refresh token lifetime in days 30 No

Setup Steps:

  1. Adjust token lifetimes based on your security requirements.
  2. Consider shorter lifetimes for more secure environments.

MinIO Settings

MinIO settings control MinIO object storage (S3-compatible):

Variable Description Default Required?
MINIO_ENDPOINT MinIO endpoint for Django minio:9000 No
MINIO_EXTERNAL_ENDPOINT External MinIO endpoint localhost Yes for production
MINIO_REGION MinIO region (optional) us-east-1 No
MINIO_ACCESS_KEY MinIO access key (username) minio Yes for production
MINIO_SECRET_KEY MinIO secret key (password) minio123 Yes for production
MINIO_USE_HTTPS Use HTTPS for MinIO False No
MINIO_EXTERNAL_ENDPOINT_USE_HTTPS Use HTTPS for external endpoint False No
MINIO_URL_EXPIRY_HOURS MinIO URL expiry in hours 7 No
MINIO_CONSISTENCY_CHECK_ON_START Check consistency on startup True No
MINIO_PRIVATE_BUCKET Private bucket name private No
MINIO_PUBLIC_BUCKET Public bucket name public No
MINIO_BUCKET_CHECK_ON_SAVE Check bucket existence on save False No
MINIO_BROWSER_REDIRECT_URL MinIO browser redirect URL http://localhost/minio-console/ No
MINIO_SERVER_URL MinIO server URL http://localhost/ No

Setup Steps:

  1. In production, set strong credentials for MINIO_ACCESS_KEY and MINIO_SECRET_KEY.
  2. When deploying to a domain other than localhost, must change MINIO_EXTERNAL_ENDPOINT to your domain (e.g., example.com), as this variable controls presigned URL generation for file downloads/uploads.
  3. If using HTTPS, set MINIO_USE_HTTPS=True and MINIO_EXTERNAL_ENDPOINT_USE_HTTPS=True.
  4. Update MINIO_BROWSER_REDIRECT_URL and MINIO_SERVER_URL to match your domain.

CORS Settings

CORS settings control Cross-Origin Resource Sharing:

Variable Description Default Required?
CSRF_TRUSTED_ORIGINS Trusted origins for CSRF Empty No
CORS_ALLOWED_ORIGINS Allowed origins for CORS Empty No
CORS_ALLOWED_ORIGIN_REGEXES Regexes for CORS origins Empty No
CORS_ALLOW_ALL_ORIGINS Allow all origins False No

Setup Steps:

  1. In production, add your domain to CSRF_TRUSTED_ORIGINS and CORS_ALLOWED_ORIGINS, e.g., CSRF_TRUSTED_ORIGINS=https://example.com,https://www.example.com.
  2. Avoid setting CORS_ALLOW_ALL_ORIGINS=True in production for security.

Authentication Settings

Authentication settings control user authentication:

Variable Description Default Required?
IS_ENTERPRISE_MODE_ACTIVE Enterprise mode False No
IS_LOGIN_ACTIVE Enable login functionality True No
IS_SIGNUP_ACTIVE Enable signup functionality False No
IS_GITHUB_LOGIN_ACTIVE Enable GitHub login False No
IS_GOOGLE_LOGIN_ACTIVE Enable Google login False No
GITHUB_CLIENT_ID GitHub OAuth client ID Empty Required for GitHub login
GITHUB_CLIENT_SECRET GitHub OAuth client secret Empty Required for GitHub login
GOOGLE_CLIENT_ID Google OAuth client ID Empty Required for Google login
GOOGLE_CLIENT_SECRET Google OAuth client secret Empty Required for Google login

Setup Steps:

  1. Note that signup is disabled by default (IS_SIGNUP_ACTIVE=False).
  2. Social logins are disabled by default.
  3. For GitHub social login:

    • Create an OAuth app at https://github.com/settings/developers.
    • Set the callback URL to http://your-domain.com/api/auth/github/callback/.
    • Add client ID and secret to environment variables.
  4. For Google social login:

    • Create credentials at https://console.developers.google.com.
    • Set authorized redirect URI to http://your-domain.com/api/auth/google/callback/.
    • Add client ID and secret to environment variables.

Email Settings

Email settings control email sending:

Variable Description Default Required?
EMAIL_BACKEND Email backend Django SMTP Backend No
EMAIL_HOST SMTP host Empty Required for email
EMAIL_PORT SMTP port 587 No
EMAIL_USE_TLS Use TLS for SMTP True No
EMAIL_HOST_USER SMTP username Empty Required for email
EMAIL_HOST_PASSWORD SMTP password Empty Required for email
DEFAULT_FROM_EMAIL Default from email Empty Required for email

Setup Steps:

  1. In production, set up a proper email service.
  2. For testing, use services like Mailhog or the built-in Postfix container.
  3. If using Gmail, generate an App Password.

Scrapy Settings

Scrapy settings control web scraping with Scrapy:

Variable Description Default Required?
SCRAPY_USER_AGENT User agent for scraping WaterCrawl/0.5.0 (+https://github.com/watercrawl/watercrawl) No
SCRAPY_ROBOTSTXT_OBEY Obey robots.txt rules True No
SCRAPY_CONCURRENT_REQUESTS Concurrent requests 16 No
SCRAPY_DOWNLOAD_DELAY Download delay (seconds) 0 No
SCRAPY_CONCURRENT_REQUESTS_PER_DOMAIN Requests per domain 4 No
SCRAPY_CONCURRENT_REQUESTS_PER_IP Requests per IP 4 No
SCRAPY_COOKIES_ENABLED Enable cookies False No
SCRAPY_HTTPCACHE_ENABLED Enable HTTP cache True No
SCRAPY_HTTPCACHE_EXPIRATION_SECS HTTP cache expiration 3600 No
SCRAPY_HTTPCACHE_DIR HTTP cache directory httpcache No
SCRAPY_LOG_LEVEL Scrapy log level ERROR No

Setup Steps:

  1. Adjust these settings based on your scraping needs.
  2. Increase concurrent requests for more aggressive scraping (but be respectful of website rules).
  3. Add a download delay for more polite scraping.

Playwright Settings

Playwright settings control browser automation with Playwright:

Variable Description Default Required?
PLAYWRIGHT_SERVER Playwright server URL http://playwright:8000 No
PLAYWRIGHT_API_KEY Playwright API key your-secret-api-key Yes for production
PORT Playwright service port 8000 No
HOST Playwright service host 0.0.0.0 No

Setup Steps:

  1. In production, set a strong PLAYWRIGHT_API_KEY for authentication between services.

Integration Settings

Integration settings control third-party integrations:

Variable Description Default Required?
OPENAI_API_KEY OpenAI API key Empty Required for AI features
STRIPE_SECRET_KEY Stripe secret key Empty Required for payments
STRIPE_WEBHOOK_SECRET Stripe webhook secret Empty Required for Stripe webhooks
GOOGLE_ANALYTICS_ID Google Analytics ID Empty Optional

Setup Steps:

  1. Get an API key from OpenAI for AI features.
  2. Set up a Stripe account for payment processing.
  3. Configure your webhook endpoint in the Stripe dashboard for Stripe webhooks.

Feature Flags

Feature flags control feature availability:

Variable Description Default Required?
MAX_CRAWL_DEPTH Maximum crawl depth -1 (unlimited) No
CAPTURE_USAGE_HISTORY Capture usage history True No

Setup Steps:

  1. Set a positive integer to limit crawl depth.
  2. Set CAPTURE_USAGE_HISTORY=False to disable usage tracking.

Frontend Settings

Frontend settings control the React frontend:

Variable Description Default Required?
API_BASE_URL API base URL for frontend /api No

Setup Steps:

  1. The default value /api works with the Nginx configuration.
  2. You can use an absolute URL (e.g., http://localhost/api) or a relative URL (e.g., /api).

Deployment Steps

Follow these steps to deploy WaterCrawl:

  1. Clone the Repository:
git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl

This step downloads the WaterCrawl code to your local machine.

  1. Create Environment File:
cp docker/.env.example docker/.env

Copy the sample environment file to the actual usage file.

  1. Edit the Environment File: Set required variables, for example:
# At minimum for production, set these:
SECRET_KEY="your-generated-secret-key"
API_ENCRYPTION_KEY="your-generated-api-encryption-key" # Generate a new one using `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`
DEBUG=False
ALLOWED_HOSTS=your-domain.com
POSTGRES_PASSWORD=your-strong-password
MINIO_ACCESS_KEY=your-minio-username
MINIO_SECRET_KEY=your-minio-password
MINIO_EXTERNAL_ENDPOINT=your-domain.com  # CRITICAL: Set to your domain
PLAYWRIGHT_API_KEY=your-strong-api-key

Note that SECRET_KEY and API_ENCRYPTION_KEY require generating secure random values, and POSTGRES_PASSWORD, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and PLAYWRIGHT_API_KEY should be set to strong passwords.

  1. Start Services:
cd docker
docker-compose up -d

Start Docker containers to run WaterCrawl.

  1. Initialize the Database (first run only):
docker-compose exec app python manage.py migrate
docker-compose exec app python manage.py createsuperuser

python manage.py migrate creates database tables, and python manage.py createsuperuser creates a superuser for management.

  1. Access the Application:
    After deployment, access different services via the following URLs:
  • Frontend: http://your-domain.com
  • API: http://your-domain.com/api
  • MinIO Console: http://your-domain.com/minio-console

Frequently Asked Questions

Connection Issues

  • Cannot connect to a service: Use docker-compose logs <service-name> to view Docker logs and understand specific error messages.
  • Database connection error: Ensure PostgreSQL is running with docker-compose ps.
  • Frontend not loading: Check JavaScript errors in the browser console to identify the issue.

Data Persistence Issues

  • Data lost after restart: Ensure Docker volumes are correctly configured to retain data after container restarts.
  • Cannot upload files: Check MinIO credentials and bucket configuration for correctness.

Performance Issues

  • Slow response times: Use docker stats to check resource usage and identify performance bottlenecks.
  • Memory issues: Adjust MinIO’s JVM settings or Gunicorn’s worker counts to optimize memory usage.

If you encounter other issues, you can submit an issue on the GitHub repository for further assistance.

Conclusion

WaterCrawl is a powerful and easy-to-deploy web crawling and data extraction tool that offers rich features and customizable options to meet the needs of different users. Through this article, we have learned about WaterCrawl’s characteristics, quick start methods, deployment guides, and solutions to common problems. We hope you can use WaterCrawl reasonably according to your needs to mine more valuable data.

When using WaterCrawl, pay attention to correctly configuring environment variables according to different environments and needs to ensure the secure and stable operation of the service. At the same time, if you encounter problems, do not panic—you can solve them by checking logs, verifying configurations, or seeking help in the community. Believe that WaterCrawl will become your powerful assistant in data crawling and analysis.