GPT Crawler: Effortlessly Crawl Websites to Build Your Own AI Assistant

Have you ever wondered how to quickly transform the wealth of information on a website into a knowledge base for an AI assistant? Imagine being able to ask questions about your project documentation, blog posts, or even an entire website’s content through a smart, custom-built assistant. Today, I’m excited to introduce you to GPT Crawler, a powerful tool that makes this possible. In this comprehensive guide, we’ll explore what GPT Crawler is, how it works, and how you can use it to create your own custom AI assistant. Whether you’re a developer, content creator, or simply curious about AI, this post has something for you.


What is GPT Crawler?

GPT Crawler is an open-source tool designed to crawl websites and extract content, generating knowledge files that can be uploaded to OpenAI to create custom GPTs or assistants. What sets GPT Crawler apart is its efficiency and flexibility. It can automatically crawl multiple pages, filter content based on your specific criteria, and produce a JSON file ready for use with OpenAI’s platforms. Whether you want to build an AI assistant for your documentation or create a knowledge base from a website, GPT Crawler simplifies the process.

Key Features of GPT Crawler

  • Automated Crawling: Crawl one or multiple URLs with ease, saving you from manually copying and pasting content.
  • Content Filtering: Use CSS selectors to extract only the information you need, such as article bodies or specific sections.
  • Customizable Configuration: Set limits on the number of pages to crawl, file sizes, and more to tailor the output to your needs.
  • Multiple Deployment Options: Run it locally, in a Docker container, or as an API service for greater flexibility.
  • Seamless Integration with OpenAI: Generate knowledge files that can be directly uploaded to OpenAI to create custom GPTs or assistants.

Why Use GPT Crawler?

You might be asking, “Why not just manually copy the content?” Here’s why GPT Crawler is a game-changer:

  • Time-Saving: Automate the extraction of large amounts of data from multiple pages.
  • Precision: Target specific parts of a webpage, ensuring you only get the content that matters.
  • Scalability: Easily crawl entire websites or documentation hubs without hassle.
  • Ease of Use: No advanced programming skills are required—just a few configuration tweaks and you’re ready to go.

A Real-World Example

Let’s say you want to create an AI assistant that can answer questions about Builder.io, a popular website-building tool. Builder.io’s documentation is available at https://www.builder.io/c/docs/developers. With GPT Crawler, you can crawl these docs, generate a knowledge file, and upload it to OpenAI to create an assistant capable of answering questions like, “How do I integrate Builder.io into my website?” This process is straightforward and practical, making it an excellent starting point for your own projects.


How to Get Started with GPT Crawler

Don’t worry—I’ll guide you through every step. We’ll begin with the most common method: running GPT Crawler locally. Later, we’ll explore alternative options like using Docker or running it as an API service.

Running GPT Crawler Locally

Step 1: Clone the Repository

You’ll need a computer with Node.js (version 16 or higher) installed. Open your terminal and run the following command to download the GPT Crawler code:

git clone https://github.com/builderio/gpt-crawler

Step 2: Install Dependencies

Navigate to the project folder and install the necessary packages:

cd gpt-crawler
npm i

This step may take a few minutes, so be patient.

Step 3: Configure the Crawler

Open the config.ts file in the project directory. This is where you tell the tool what to crawl and how to crawl it. You’ll need to adjust a few key settings:

  • url: The starting URL for the crawl.
  • match: A pattern to determine which pages to crawl further.
  • selector: The CSS selector that identifies the content you want to extract.
  • maxPagesToCrawl: The maximum number of pages to crawl.
  • outputFileName: The name of the generated JSON file.

For example, to crawl the Builder.io documentation, you can use:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};
Understanding Configuration Options

Here’s a breakdown of the most important configuration options:

Configuration Option Description Example Value
url The starting URL for the crawl https://www.example.com
match Pattern to match which pages to crawl https://www.example.com/**
selector CSS selector for content extraction .content
maxPagesToCrawl Maximum number of pages to crawl 50
outputFileName Name of the output JSON file output.json
maxFileSize Maximum file size in MB (optional) 10
maxTokens Maximum number of tokens (optional) 10000

Feel free to adjust these settings based on your project. For instance, if you only need to crawl 10 pages, set maxPagesToCrawl to 10.

Step 4: Run the Crawler

Once your configuration is ready, run the following command:

npm start

The crawler will begin its work, and upon completion, you’ll find the output.json file in the project’s root directory. This file contains the extracted content, ready to be uploaded to OpenAI.


Alternative Methods for Running GPT Crawler

If local installation isn’t your preference, GPT Crawler offers two other convenient options:

1. Running in a Docker Container

For those comfortable with Docker, you can run GPT Crawler in a containerized environment. Here’s how:

  • Navigate to the containerapp directory.
  • Modify the config.ts file to suit your needs.
  • Follow the provided instructions to start the container.
  • Once the crawl is complete, the output.json file will be generated in the data folder.

This method is ideal for users who want to isolate the crawler’s environment or deploy it in a cloud setting.

2. Running as an API Service

If you prefer to interact with GPT Crawler over a network, you can run it as an API service. Here’s how to get started:

  • In the project root directory, run:

    npm run start:server
    
  • The server will start on port 3000 by default.
  • You can then send a POST request to the /crawl endpoint with your configuration JSON to initiate the crawl.
  • For detailed API documentation, visit /api-docs.

This option is perfect for integrating GPT Crawler into larger systems or automating crawls via API calls.


Uploading Your Crawled Data to OpenAI

Now that you have your output.json file, it’s time to turn it into an AI assistant. You can upload this file to OpenAI to create either a custom GPT or a custom assistant, depending on your needs.

Creating a Custom GPT

This option is ideal if you want a user-friendly interface for your AI assistant, which you can easily share with others.

Steps to Create a Custom GPT:

  1. Visit https://chat.openai.com/.
  2. Click on your name in the bottom left corner and select “My GPTs.”
  3. Choose “Create a GPT” and then “Configure.”
  4. Under the “Knowledge” section, click “Upload a file” and select your output.json.
  5. If you encounter an error due to file size, try splitting the file or adjusting the maxFileSize or maxTokens settings in your configuration.

Note: You may need a paid ChatGPT plan to access this feature.

Creating a Custom Assistant

This option is best if you want to integrate the AI assistant into your product via API.

Steps to Create a Custom Assistant:

  1. Go to https://platform.openai.com/assistants.
  2. Click “+ Create.”
  3. Select “upload” and upload your output.json file.

With these simple steps, your crawled data will be transformed into a fully functional AI assistant.


Frequently Asked Questions (FAQs)

You might still have some questions about GPT Crawler. Let’s address the most common ones:

What websites can GPT Crawler crawl?

GPT Crawler can crawl any publicly accessible website. You define the starting URL and the pattern for which pages to crawl using the url and match configuration options.

How do I find the correct CSS selector?

To find the CSS selector for the content you want to extract:

  • Open your browser and press F12 to access the developer tools.
  • Right-click on the element (e.g., the article body) and select “Inspect.”
  • In the developer tools, find the element’s CSS selector, such as .article or #content.

What happens to the crawled content?

The tool extracts the text based on your specified CSS selector and compiles it into a JSON file. This file is structured to be compatible with OpenAI’s platforms.

What if my output file is too large to upload?

If your file exceeds OpenAI’s upload limits, you have two options:

  • Set maxFileSize or maxTokens in the configuration to reduce the file size.
  • Manually split the file into smaller parts and upload them separately.

Can GPT Crawler crawl pages that require login?

Currently, GPT Crawler only supports crawling publicly accessible pages. It cannot handle pages that require authentication.


Best Practices for Using GPT Crawler

To get the most out of GPT Crawler, consider these tips:

  • Choose the Right Selector: Spend time finding the precise CSS selector to avoid extracting unnecessary content like navigation menus or footers.
  • Limit the Crawl Scope: Use the match pattern to focus only on relevant pages. For example, if you’re crawling documentation, set match to include only documentation URLs.
  • Monitor File Size: If you’re dealing with large websites, use maxPagesToCrawl and maxFileSize to keep the output manageable.
  • Test with a Small Crawl: Before running a full crawl, test with a small number of pages to ensure your configuration works as expected.

Why Custom AI Assistants Are a Game-Changer

Creating a custom AI assistant using GPT Crawler opens up a world of possibilities:

  • Enhanced User Experience: Provide instant, intelligent answers to user questions without them having to sift through documentation.
  • Time-Saving: Automate responses to common queries, freeing up your time for more complex tasks.
  • Scalable Knowledge Sharing: Turn your website’s content into a dynamic knowledge base that can be accessed anytime, anywhere.

Whether you’re building an assistant for internal use or for your customers, the benefits are clear: improved efficiency, better engagement, and a smarter way to leverage your content.


Troubleshooting Common Issues

Even with a straightforward tool like GPT Crawler, you might encounter some hiccups. Here are solutions to common problems:

Problem: The crawler isn’t extracting the content I want.

Solution: Double-check your CSS selector. Make sure it accurately targets the desired content. You can use your browser’s developer tools to verify the selector.

Problem: The output file is too large.

Solution: Adjust the maxPagesToCrawl setting to limit the number of pages crawled. Alternatively, use maxFileSize or maxTokens to control the file’s size.

Problem: The crawler is taking too long to run.

Solution: Reduce the number of pages to crawl or optimize your match pattern to exclude unnecessary pages.

Problem: I’m getting errors when uploading to OpenAI.

Solution: Ensure your file is in the correct JSON format and within the size limits. If needed, split the file or adjust your crawl settings.


Conclusion

In summary, GPT Crawler is a versatile and powerful tool that simplifies the process of crawling websites to create knowledge files for AI assistants. With its easy setup, flexible configuration, and seamless integration with OpenAI, you can quickly generate the data needed to build your own custom GPT or assistant. Whether you’re looking to enhance your project documentation, provide intelligent Q&A for users, or explore the possibilities of AI, GPT Crawler is an invaluable resource.

Ready to get started? Give GPT Crawler a try and unlock the potential of AI-powered knowledge today!