Granite Docling Logo

Introduction: The Challenge of Document Understanding in the Digital Age

In today’s enterprise environments, organizations process countless documents daily—contracts, reports, academic papers, technical manuals, and more. While traditional optical character recognition (OCR) technologies can extract text from these documents, they often fail to preserve the underlying structure: tables become disorganized, mathematical formulas render incorrectly, code snippets lose their formatting, and even paragraph sequencing can become disrupted. This structural loss significantly reduces information retrieval efficiency and creates substantial challenges for automated document processing pipelines.

IBM’s recently released Granite-Docling-258M represents a transformative approach to these challenges. This completely open-source, Apache 2.0 licensed multimodal vision-language model specializes in end-to-end document conversion. It goes beyond traditional OCR by understanding document structure and preserving layout elements through an innovative intermediate representation called DocTags.

What Makes Granite-Docling-258M Different?

Granite-Docling-258M is a 258-million parameter multimodal model that accepts both image and text inputs to produce structured text outputs. Its standout capability is preserving original document layout while accurately extracting tables, code, mathematical formulas, lists, captions, and other structural elements. The model outputs DocTags—a structured intermediate representation that seamlessly converts to Markdown, HTML, or JSON formats, providing high-quality structured input for downstream tasks like retrieval-augmented generation (RAG) or data analytics.

Think of it as an intelligent assistant that not only “sees” text but also “understands” document structure and relationships between elements.

Key Improvements Over Previous Models

Granite-Docling-258M serves as the production-ready successor to the earlier SmolDocling-256M preview version. IBM implemented several crucial enhancements in this release:

  • Enhanced Language Model: Replaced the previous language model with Granite 165M, significantly improving text understanding and generation capabilities
  • Advanced Vision Encoder: Implemented SigLIP2 as the vision encoder, substantially boosting image content comprehension accuracy
  • Improved Stability: Addressed previous issues with repetitive outputs or infinite loops, making the model more suitable for enterprise deployment

Quantitative evaluations demonstrate significant across-the-board improvements:

Performance Comparison: Granite-Docling-258M vs. SmolDocling-256M

Task Category Granite-Docling-258M SmolDocling-256M
Layout Analysis (MAP) 0.27 0.23
Full-page OCR (F1) 0.84 0.80
Code Recognition (F1) 0.988 0.915
Mathematical Formula (F1) 0.968 0.947
Table Recognition (TEDS-structure) 0.97 0.82
Multimodal Understanding (MMStar) 0.30 0.17
Comprehensive OCR (OCRBench) 500 338

The data indicates that the new model not only performs better across individual tasks but demonstrates particularly remarkable improvements in code and table recognition, approaching near-perfect performance in these areas while also showing enhanced overall capabilities in comprehensive evaluations.

Architectural Overview: How Granite-Docling-258M Works

Model Architecture

Granite-Docling-258M builds upon the Idefics3 architecture but incorporates two critical modifications:

  1. Vision Encoder: Utilizes siglip2-base-patch16-512 instead of the original component, significantly enhancing image understanding capabilities
  2. Language Model: Employs IBM’s proprietary Granite 165M model as the text generation core

Connecting the visual and language modules is a pixel-shuffle projector that ensures efficient transfer of visual information to the language model.

Training Data and Methodology

The model training incorporated several specialized datasets:

  • SynthCodeNet: Synthetic code snippets spanning over 50 programming languages
  • SynthFormulaNet: Synthetically generated mathematical expressions paired with LaTeX representations
  • SynthChartNet: Chart images annotated with structured table outputs
  • DoclingMatix: Curated corpus of real-world document pages from diverse domains

Training occurred on IBM’s Blue Vela supercomputing cluster equipped with NVIDIA H100 GPUs, providing substantial computational power for model development.

The training framework utilized nanoVLM, a lightweight, pure-PyTorch implementation of a vision-language model training toolkit suitable for efficient iteration and experimentation.

The DocTags Revolution: Transforming Document AI

Traditional document processing pipelines typically follow a sequential approach: first extracting text with OCR, then attempting to reconstruct structure using rules or smaller models, finally outputting to Markdown or HTML. This process is not only cumbersome but frequently loses significant structural information.

Granite-Docling-258M offers a more elegant solution: it directly outputs DocTags—a markup language designed by IBM to precisely describe document structure. DocTags provide explicit representation of:

  • Element types (headings, paragraphs, tables, formulas, etc.)
  • Element coordinates within pages
  • Logical relationships between elements (section ownership, reading order, etc.)

This approach enables downstream systems to perform more accurate retrieval, conversion, or analysis based directly on DocTags without worrying about information loss or misinterpretation.

Practical Implementation: How to Use Granite-Docling-258M

Approach 1: Using the Docling Library (Recommended for Most Users)

The simplest method involves using the docling command-line tool or Python library:

# Installation
pip install docling

# Convert PDF to HTML and Markdown
docling --to html --to md --pipeline vlm --vlm-model granite_docling "your_document.pdf"

You can also invoke it programmatically in Python:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
doc = converter.convert("your_document.pdf").document
markdown_output = doc.export_to_markdown()

Approach 2: Direct Usage with Transformers Library

For users wanting more direct control over the inference process, the transformers library provides direct access:

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForVision2Seq.from_pretrained("ibm-granite/granite-docling-258M")

image = Image.open("your_document.png")
inputs = processor(images=image, text="Convert this page to docling.", return_tensors="pt")
outputs = model.generate(**inputs)
doctags = processor.decode(outputs[0], skip_special_tokens=True)

Approach 3: Batch Processing with vLLM

For users requiring high-volume document processing, vLLM offers enhanced inference speed:

from vllm import LLM, SamplingParams

llm = LLM(model="ibm-granite/granite-docling-258M")
sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)
outputs = llm.generate(batched_prompts, sampling_params=sampling_params)

Approach 4: Apple Silicon Optimization with MLX Version

IBM provides a specialized MLX version optimized for Apple Silicon chips, ideal for local execution on MacBook or Mac Studio devices:

# Install MLX version
pip install mlx-vlm

Detailed usage instructions are available on the Hugging Face MLX model page.

Multilingual Capabilities

While Granite-Docling-258M is primarily optimized for English documents, it offers preliminary support for Japanese, Arabic, and Chinese. It’s important to note that multilingual support remains experimental, and processing quality for non-English documents may be less consistent than English performance.

Practical Applications and Use Cases

This model is particularly well-suited for several application scenarios:

  • Enterprise Document Digitization: Converting historical paper documents or scans into structured digital formats
  • Academic Paper Processing: Accurately extracting formulas, charts, and references from research papers
  • Legal and Contract Analysis: Preserving original structure and formatting of contractual terms and conditions
  • Technical Documentation Conversion: Processing manuals containing code, tables, and diagrams
  • Retrieval-Augmented Generation (RAG): Providing high-quality structured document sources for knowledge base systems

Understanding Limitations and Responsible Use

Despite its advanced capabilities, Granite-Docling-258M has certain limitations:

  • It is not a general-purpose vision-language model and is unsuitable for processing natural images or general visual question-answering tasks
  • Multilingual support is not yet fully mature, and processing quality for non-English documents may vary
  • Like all AI models, it may occasionally produce errors or hallucinated outputs, suggesting the need for human review in critical applications

IBM emphasizes principles of responsible use, recommending that users avoid applications that might perpetuate biases, disseminate misinformation, or enable automated decision-making without appropriate safeguards.

Evaluation Methodology and Performance Metrics

The comprehensive evaluation of Granite-Docling-258M employed multiple assessment frameworks:

  • docling-eval: For document-related tasks
  • lmms-eval: For MMStar and OCRBench evaluations
  • Task-specific datasets: For specialized capability assessments

The evaluation results demonstrate consistent improvements across all measured capabilities, with particularly notable gains in code recognition (F1 score of 0.988 vs. 0.915) and table recognition (TEDS-structure of 0.97 vs. 0.82).

Supported Instructions and Commands

Granite-Docling-258M responds to various instructional prompts for specialized processing:

Description Instruction Short Instruction
Full conversion Convert this page to docling.
Chart conversion Convert chart to table. <chart>
Formula conversion Convert formula to LaTeX. <formula>
Code conversion Convert code to text. <code>
Table conversion Convert table to OTSL. (Lysak et al., 2023) <otsl>
Targeted OCR OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>
Element identification Identify element at: <loc_247><loc_482><loc_252><loc_486>
Element retrieval Find all ‘text’ elements on the page, retrieve all section headers.
Structural element detection Detect footer elements on the page.

Integration with Existing Workflows

Granite-Docling-258M integrates seamlessly with various runtime environments:

  • Transformers: Standard Hugging Face transformers library
  • vLLM: High-throughput deployment for batch processing
  • ONNX: Framework-interoperable deployment
  • MLX: Apple Silicon-optimized execution

The model is designed as a component within larger Docling pipelines rather than as a general-purpose VLM, reflecting IBM’s focused approach to document-specific AI capabilities.

Conclusion: The Future of Document AI

Granite-Docling-258M represents a significant advancement in document AI technology. By combining advanced vision encoding, powerful language modeling, and the innovative DocTags representation, it delivers a comprehensive yet efficient document conversion solution. Its open-source availability under the Apache 2.0 license makes this technology accessible to organizations of all sizes.

Whether you’re researching intelligent document processing systems or developing enterprise-scale document solutions, Granite-Docling-258M offers a powerful tool worth exploring. Its development moves us closer to truly “intelligent” document understanding that preserves both content and context.

Frequently Asked Questions

Q: Can Granite-Docling-258M process handwritten documents?
A: The model is primarily optimized for printed documents and has limited capability with handwritten content.

Q: What document formats does the model support?
A: Through the Docling library, it can process PDF, Word, PowerPoint, and various image formats.

Q: How much computational resources are required to run this model?
A: With only 258 million parameters, the model can run on consumer-grade GPUs or even CPUs, with additional optimizations available for Apple Silicon.

Q: What is the DocTags format and how can it be converted to common formats?
A: DocTags is a structured representation format that can be easily converted to Markdown, HTML, or JSON using the Docling library.

Q: What types of content were included in the training data?
A: The training incorporated both public datasets and IBM’s synthetic document data, covering code, formulas, charts, and various other elements.

Q: Does the model support batch processing?
A: Yes, batch processing is supported through vLLM and other batch processing tools.

For those interested in exploring Granite-Docling-258M, visit the Hugging Face model page to experience the interactive demo or consult the Docling project documentation for comprehensive integration guidance.