Unlocking Historical Archives with AI: The SEB-OCR Technical Guide

Why We Need Intelligent Historical Document Processing

In political science, history, and archival research, vast collections of historical materials exist as scanned images. Traditional OCR technology can recognize text but struggles with 「contextual relationships」, 「cross-page references」, and 「semantic structure」. This is where SEB-OCR delivers transformative value—it uses 「multimodal AI models」 to convert disordered historical scans into structured, analyzable datasets.

SEB-OCR System Workflow
Five-step pipeline transforms images into structured data

Technical Architecture: The Five-Step Transformation Process

Step 1: Intelligent OCR Transcription

  • 「Core Technology」: Google’s Gemini multimodal model
  • 「Key Innovations」:

    • Adaptive rate limiter dynamically controls API call frequency
    • Parallel processing accelerates transcription (default: 10 threads)
    • Handcrafted prompts optimize historical document recognition
# Conceptual call structure (simplified)
gemini.ocr(image, prompt="Transcribe 19th-century printed text preserving original spellings")

Step 2: Sliding-Window Entity Extraction

Parameter Default Function
WINDOW_SIZE 5 Consecutive pages per window
WINDOW_STEP 2 Window slide step (3-page overlap)
[Page1] [Page2] [Page3] [Page4] [Page5]  ← Window 1
        [Page3] [Page4] [Page5] [Page6]  ← Window 2 (3-page overlap)

Step 3: Incremental Caching System

output/
├── transcriptions/    # Raw OCR text
├── window_outputs/    # Window-level JSON cache
└── final_outputs/     # Processed results

Step 4: Semantic Deduplication

  1. Extract candidate entities (people/organizations/locations)
  2. Generate vectors using text-embedding-004
  3. Cluster by cosine similarity to merge duplicates

Step 5: Dual-Format Output

  • entities.json: Machine-readable structured data
  • entities.csv: Researcher-friendly spreadsheet format

Practical Implementation Guide

Environment Setup

# Clone repository
git clone https://github.com/ALucek/seb-ocr.git
cd seb-ocr

# Install dependencies (uv recommended)
uv sync

Configuration Template (.env)

GEMINI_API_KEY = "your-secret-key"  # Required

# Tuning parameters (optional)
GEMINI_MODEL = "gemini-2.5-flash"
MAX_WORKERS = 12      # Parallel threads
WINDOW_SIZE = 6       # Larger context windows

File Naming Conventions

Place scans in input_images/ with embedded page numbers:

001.jpg      # Valid
page_42.png  # Valid
document.pdf # Invalid (missing page number)

Execution Modes

# Full pipeline (OCR → extraction → deduplication)
uv run main.py all

# OCR transcription only
uv run main.py transcribe

# Entity extraction (requires existing transcriptions)
uv run main.py extract

Core Technical Advantages

Context-Aware Processing

Traditional OCR tools fail with cross-page content. For example:

  • Political manifesto signatures on page 4
  • Related text on page 5
    Window overlap solves this:
  1. Window 1 (pages 1-5) captures signatures
  2. Window 2 (pages 3-7) links to content
  3. System auto-merges entities

Dynamic Resource Management

graph LR
A[API Request] --> B{Quota Check}
B -->|Limit Exceeded| C[Wait]
B -->|Within Limit| D[Send Request]
D --> E[Update Counters]

Validation Safeguards

  • Outputs validate against Pydantic schemas
  • Automatic retries for API errors
  • Real-time error logging

Frequently Asked Questions (FAQ)

「Q: How does it handle non-sequential page numbers?」
A: Files sort numerically by embedded digits (“page_10” after “page_2”)

「Q: Can it process handwritten documents?」
A: Optimized for printed text; handwriting depends on Gemini’s capabilities

「Q: How to adjust deduplication sensitivity?」
A: Modify cosine similarity threshold in source code (default=0.85)

「Q: Supported image formats?」
A: .jpg, .png, .webp via PIL library

「Q: Processing time for 100 pages?」
A: ~25 minutes at default 50 RPM rate

Academic Application Case Study

Research team analyzing 1945-1950 parliamentary records:

  1. Processed 800 scanned pages
  2. Identified 1,247 political entities
  3. Detected 17 duplicate entries
  4. Generated voting behavior network analysis

Conclusion: Democratizing Research Access

SEB-OCR’s significance extends beyond technology—it 「lowers research barriers」. Tasks requiring weeks of manual effort now execute via simple commands. As the creator notes: “We don’t replace researchers; we liberate them to focus on genuine discovery.”

Project available under MIT License at GitHub repository


Content derived exclusively from seb-ocr documentation. Technical specifications subject to official documentation.