Extractous: The High-Performance Document Extraction Solution

Introduction

In today’s data-driven world, the ability to efficiently extract content and metadata from various document formats has become crucial for businesses and developers alike. Whether processing legal documents, financial reports, or analyzing web content, quickly and accurately retrieving information is essential. While several tools exist in the market, most solutions face performance limitations, complex dependencies, or require external services.

Enter Extractous – an open-source tool that delivers exceptional performance, simple interfaces, and comprehensive format support for document content extraction.

What is Extractous?

Extractous is a high-performance tool specifically designed for extracting content and metadata from various document formats. It supports numerous common file types including PDF, Word, HTML, and many others, leveraging its Rust foundation and deep integration with Apache Tika to achieve outstanding speed and resource efficiency.

Unlike many tools that depend on external APIs or local services, Extractous operates entirely locally without requiring network calls or additional service processes. This approach results in lower latency, fewer dependencies, and better privacy control.

The Need for Extractous

You may have encountered or used other tools for parsing unstructured data, such as unstructured-io. These tools typically wrap multiple large Python libraries, resulting in slow performance and high memory usage. Additionally, due to Python’s Global Interpreter Lock (GIL) limitations, they often cannot fully utilize multi-core CPU processing capabilities.

Moreover, as these tools evolve, they tend to become increasingly complex, eventually requiring separate deployment as service frameworks. This not only adds complexity to system architecture but also creates unnecessary burdens for many application scenarios.

Extractous maintains a “simple and efficient” design philosophy:

  • Rust-based core: Leverages Rust’s high performance, memory safety, and concurrency capabilities
  • Native format support + Apache Tika extension: Provides native parsing for common formats while compiling Apache Tika into native libraries via GraalVM to support additional file types without virtual machines or garbage collection overhead
  • Multi-language bindings: Currently offers Python bindings with plans to support more languages, helping developers easily integrate into existing tech stacks
  • True multi-core utilization: Fundamentally avoids GIL issues, fully leveraging multi-core CPU advantages

Key Features at a Glance

  • High performance: Optimized for speed and low memory usage
  • Simple API: Provides clear and easy-to-use interfaces for text and metadata extraction
  • Automatic document type recognition: Automatically identifies file types and applies appropriate parsing methods
  • Broad format support: Covers most formats supported by Apache Tika
  • OCR text recognition: Supports text extraction from images and scanned documents through Tesseract integration
  • Multi-language support: Core built in Rust with Python bindings and planned JavaScript/TypeScript support
  • Comprehensive documentation: Offers extensive documentation and code examples to reduce learning curve
  • Commercial friendly: Apache 2.0 licensed, free for commercial use

Getting Started with Extractous

Installing and using Extractous is straightforward. Here are some basic examples to help you get acquainted with its usage.

Python Examples

First, you need to install the Extractous Python package:

pip install extractous

Example 1: Extract File Content to String

from extractous import Extractor

# Initialize extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# Uncomment the next line if you need XML output
# extractor = extractor.set_xml_output(True)

# Extract text and metadata from file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)

Example 2: Stream File Content Extraction

For large files, stream processing can save memory.

from extractous import Extractor

extractor = Extractor()
# Extract file (also supports URL or byte array)
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# Extract URL
# reader, metadata = extractor.extract_url("https://www.google.com")
# Extract byte array
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)

Example 3: Extract PDF Content with OCR

First install Tesseract and appropriate language packs (Debian example):

sudo apt install tesseract-ocr tesseract-ocr-deu
from extractous import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string("../../test_files/documents/eng-ocr.pdf")

print(result)
print(metadata)

Rust Examples

If you prefer using Rust directly, add the dependency to your Cargo.toml:

[dependencies]
extractous = "0.1"

Example 1: Extract File Content to String

use extractous::Extractor;

fn main() {
    // Initialize extractor (using consuming builder pattern)
    let mut extractor = Extractor::new().set_extract_string_max_length(1000);
    // Uncomment next line if you need XML output
    // extractor = extractor.set_xml_output(true);

    // Extract file content
    let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
    println!("{}", text);
    println!("{:?}", metadata);
}

Example 2: Stream Content Extraction

use std::io::{BufReader, Read};
// use std::fs::File;  // For reading bytes
use extractous::Extractor;

fn main() {
    let args: Vec<String> = std::env::args().collect();
    let file_path = &args[1];

    let extractor = Extractor::new();
    let (stream, metadata) = extractor.extract_file(file_path).unwrap();
    // Extract URL
    // let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
    // Extract bytes
    // let mut file = File::open(file_path)?;
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer)?;
    // let (stream, metadata) = extractor.extract_bytes(&file_bytes);

    // Since stream implements std::io::Read, we can perform buffered reading
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer).unwrap();

    println!("{}", String::from_utf8(buffer).unwrap());
    println!("{:?}", metadata);
}

Example 3: Extract PDF Content with OCR

use extractous::Extractor;

fn main() {
  let file_path = "../test_files/documents/deu-ocr.pdf";

    let extractor = Extractor::new()
          .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
          .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
    // Extract file with configuration
  let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
  println!("{}", content);
  println!("{:?}", metadata);
}

Performance Analysis: The Numbers Speak

Extractous was designed with performance as a core objective. But how does it actually perform?

Based on testing with SEC10 filings PDF documents, Extractous demonstrates significant advantages:

  • Speed improvement: On average 18 times faster than unstructured-io. This means tasks that previously took an hour can now be completed in just a few minutes.
Extraction Speed Comparison Chart
  • Memory efficiency: Uses 11 times less memory allocation than unstructured-io. The significantly reduced memory requirements allow Extractous to run smoothly in resource-constrained environments while handling more tasks.
Memory Efficiency Comparison Chart
  • Output quality: Beyond speed and efficiency, extraction accuracy is equally important. Tests show that Extractous also has slight advantages in content extraction quality.
Output Quality Comparison Chart

These data points clearly demonstrate that Extractous excels not only in speed but also in resource utilization and output quality.

Supported File Formats

Extractous supports numerous common document formats. Here’s a partial list:

Category Supported Formats Description
Microsoft Office DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF Includes legacy and modern Office formats
OpenOffice ODT, ODS, ODP OpenDocument formats
PDF PDF Can extract embedded content and supports OCR
Spreadsheets CSV, TSV Plain text spreadsheet formats
Web Documents HTML, XML Parses and extracts web content
E-books EPUB E-book format
Text Files TXT, Markdown Plain text formats
Images PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG Extracts embedded text via OCR
Email EML, MSG, MBOX, PST Extracts content, headers, and attachments

Frequently Asked Questions

Q: How does Extractous achieve high performance?

A: Extractous’s core is written in Rust, fully utilizing Rust’s zero-cost abstractions and memory safety features. Additionally, it compiles Apache Tika into native libraries via GraalVM, avoiding virtual machine overhead and achieving truly efficient execution.

Q: Is Java Runtime Environment (JRE) installation required?

A: No. Extractous compiles Apache Tika into native libraries through GraalVM, so no Java virtual machine or runtime environment installation is necessary.

Q: Does Extractous support Chinese OCR?

A: Yes. As long as the system has the appropriate Tesseract language pack (such as Chinese) installed and configured correctly through the set_language method, Extractous can handle Chinese OCR tasks.

Q: Can it handle encrypted PDF files?

A: The current version of Extractous primarily focuses on content extraction functionality. Support for encrypted PDFs may require testing specific cases or consulting the latest documentation.

Q: How can I contribute?

A: Extractous is an open-source project welcoming community contributions. You can submit suggestions for improvements or new features through Issues or Pull Requests.

Advanced Usage Scenarios

Handling Large-Scale Document Processing

For enterprise environments processing thousands of documents daily, Extractous offers several advantages:

Batch Processing Example:

from extractous import Extractor
from concurrent.futures import ThreadPoolExecutor
import os

def process_file(file_path):
    extractor = Extractor()
    try:
        content, metadata = extractor.extract_file_to_string(file_path)
        return {"file": file_path, "content": content, "metadata": metadata}
    except Exception as e:
        return {"file": file_path, "error": str(e)}

# Process multiple files in parallel
def process_directory(directory_path):
    files = [os.path.join(directory_path, f) for f in os.listdir(directory_path)]
    results = []
    
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(process_file, files))
    
    return results

Custom Extraction Configurations

Extractous allows detailed configuration for different document types:

from extractous import Extractor, PdfParserConfig, TesseractOcrConfig

# Custom configuration for different scenarios
pdf_config = PdfParserConfig().set_ocr_strategy("AUTO")
ocr_config = TesseractOcrConfig().set_language("eng").set_page_segmentation_mode(6)

extractor = (Extractor()
            .set_pdf_config(pdf_config)
            .set_ocr_config(ocr_config)
            .set_extract_string_max_length(5000))

Integration with Data Pipelines

Extractous can be seamlessly integrated into existing data processing workflows:

Example: Integration with Apache Airflow

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from extractous import Extractor

def extract_documents(**kwargs):
    extractor = Extractor()
    # Process documents from various sources
    sources = kwargs['params']['sources']
    
    for source in sources:
        if source.startswith('http'):
            content, metadata = extractor.extract_url(source)
        else:
            content, metadata = extractor.extract_file_to_string(source)
        
        # Process extracted content further
        process_extracted_data(content, metadata)

# Define Airflow DAG
dag = DAG('document_processing', schedule_interval='@daily',
          start_date=datetime(2023, 1, 1))

extract_task = PythonOperator(
    task_id='extract_documents',
    python_callable=extract_documents,
    op_kwargs={'sources': ['file1.pdf', 'file2.docx']},
    dag=dag
)

Performance Optimization Tips

  1. Memory Management: For memory-constrained environments, use stream-based extraction instead of string-based methods
  2. Concurrency: Utilize Rust’s native async capabilities or Python’s threading for parallel processing
  3. Caching: Implement caching mechanisms for frequently processed documents
  4. Configuration Tuning: Adjust parser configurations based on specific document types for optimal performance

Conclusion

Extractous represents a significant advancement in document content extraction technology. Through innovative architectural design, it delivers exceptional performance and efficiency while maintaining simplicity and ease of use. Whether processing large volumes of enterprise documents or integrating into complex data processing pipelines, Extractous provides a reliable and high-performance solution.

Its open-source nature means it can be freely used and modified, while comprehensive documentation and examples significantly lower the learning curve. If you’re looking for a tool that can quickly, accurately, and efficiently extract document content, Extractous is certainly worth your attention and consideration.

The project source code and more detailed information can be found on GitHub. As the project continues to evolve, we can expect even more features and improvements that will further enhance its capabilities in the document processing landscape.