Dolphin: A New Star in Multimodal Document Image Parsing

In the digital age, document image parsing has become a crucial task in information processing. Recently, ByteDance has open-sourced a novel multimodal document image parsing model called Dolphin, which brings new breakthroughs to this field. Dolphin focuses on parsing complex document images that contain a mix of text, tables, formulas, images, and other elements. Below, we will delve into this model to explore its working principles, architecture, functions, applications, and more.

Why Document Image Parsing Matters?

Document image parsing plays a pivotal role in various information processing scenarios. From office automation to academic research, data analysis to content creation, efficient and accurate document image parsing can significantly enhance work efficiency and information utilization. However, the complexity of document images poses numerous challenges for parsing tasks. Text, tables, formulas, etc., are often intertwined, and layouts can vary greatly. This demands that parsing models possess strong analysis and understanding capabilities.

How Dolphin Works

Dolphin employs a two-stage “analyze-then-parse” approach to tackle the challenges of intertwined elements in documents.

The first stage is page-level layout analysis. It comprehensively analyzes the entire document image and generates a sequence of page elements in the natural reading order. For example, in a document containing text paragraphs and tables, it first identifies the text part and then locates the table.

The second stage is element-level parallel parsing. In this stage, Dolphin uses different types of “heterogeneous anchors” and task-specific prompts to efficiently parse individual elements. For instance, when dealing with table elements, it can recognize the table structure and extract content through specific prompts; for formula elements, it can switch to a formula parsing mode. Moreover, it supports parallel processing, enabling simultaneous parsing of multiple elements and greatly improving overall efficiency.

Dolphin’s Model Architecture

Dolphin is based on a vision encoder-decoder architecture and integrates several advanced technologies.

The vision encoder leverages the Swin Transformer. Swin Transformer is a powerful tool for extracting visual features from document images. It can capture rich visual information such as the shape, layout, and color of text, as well as lines in tables and contours in images. These visual features serve as the foundation for subsequent parsing.

The text decoder is based on MBart. MBart is an excellent text generation model that can decode the visual features extracted by the vision encoder into readable text content. Whether it is continuous text paragraphs, text in tables, or symbols in formulas, MBart can perform accurate decoding.

More importantly, Dolphin features a prompt-based interface. This interface acts like a flexible commander, controlling parsing tasks through natural language prompts. For example, when we want to extract table content, we can provide a prompt like “extract the table”; when we focus on the meaning of text paragraphs, we can input an instruction like “parse the text”. This flexibility allows Dolphin to adapt to a wide variety of parsing requirements.

In addition, Dolphin has been integrated into the Hugging Face Transformers ecosystem. This enables developers to easily incorporate Dolphin into their projects and work collaboratively with other Transformers models, thereby expanding its application scope.

Dolphin’s Functions and Application Scenarios

Dolphin supports two main parsing modes, covering a broad range of document image processing needs.

One is page parsing mode, which is used to process entire document images. For example, when we have a scanned academic paper document and want to digitize all its content, including text, charts, formulas, etc., we can use page parsing mode. Dolphin will parse the elements on the page in the natural reading order and generate complete digital content.

The other is element parsing mode, which focuses on parsing specific element images. For instance, in a financial report, if we are only interested in the data within a particular table, we can crop out that table part and use element parsing mode for precise parsing. Similarly, for formula elements in a document, we can extract the formula content for in-depth analysis.

Dolphin has demonstrated excellent performance in various page-level and element-level parsing tasks. In page-level parsing, it can accurately identify elements such as document titles, paragraphs, and image locations. In element-level parsing, it can effectively handle complex situations like multi-level table headers and merged cells. Its lightweight architecture and parallel mechanism ensure efficient operation. Whether dealing with simple document images or complex multi-element documents, Dolphin can quickly deliver parsing results.

Dolphin’s Technical Specifications

Dolphin has 398 million parameters, a moderate size that ensures model performance while avoiding excessive computational resource consumption. It supports Chinese and English, making it suitable for processing bilingual document images and meeting a wide range of language requirements.

In terms of functionality, Dolphin covers capabilities such as OCR (optical character recognition), layout analysis, and table extraction. Its multimodal nature allows it to comprehensively utilize visual and textual information, enhancing parsing accuracy and comprehensiveness. As a visual-language model, Dolphin offers a new perspective and solution for document image parsing.

Under the MIT License, Dolphin allows developers to freely use, modify, and distribute the model, promoting its dissemination and application. The file format is Safetensors, supporting FP16 and I64 tensor types, which ensures efficient storage and rapid computation.

How to Use Dolphin?

To use Dolphin, you can visit its GitHub repository (https://github.com/bytedance/Dolphin), where detailed usage guidelines and example codes are provided.

For page parsing, there is a demo_page_hf.py script available for reference. Through this script, developers can easily input an entire document image into the Dolphin model and obtain parsing results. For example, inputting a scanned academic paper image, Dolphin will output the structured content of the paper, including title, authors, abstract, body paragraphs, figure captions, references, and more, facilitating subsequent text processing and data analysis.

In element parsing, the demo_element_hf.py example code demonstrates how to parse specific elements. For instance, cropping a table part from a product manual and using this script allows for precise extraction of product parameters, prices, and other information from the table, providing accurate data support for product management or market analysis.

Dolphin’s Advantages and Innovations

Compared to traditional document image parsing methods, Dolphin has numerous advantages and innovations.

Firstly, its two-stage parsing process aligns more closely with the human reading and understanding habits of documents. Conducting overall layout analysis first to clarify the element sequence, followed by in-depth parsing of each element, results in more accurate and logical parsing outcomes.

Secondly, Dolphin’s multimodal architecture is a significant innovation. It combines visual and textual information organically, enabling the full utilization of various cues in document images. For example, when parsing tables, it can not only recognize data through text but also determine table lines and row-column structures via visual features, improving table parsing accuracy.

Moreover, the introduction of the prompt interface greatly enhances Dolphin’s flexibility and expandability. Simple natural language prompts can guide the model to perform different parsing tasks without the need for complex retraining or adjustments of the model. This flexibility enables Dolphin to quickly adapt to various new document image parsing requirements and offers strong practicality.

Lastly, Dolphin’s efficiency and lightweight architecture are also notable advantages. While ensuring parsing performance, it can rapidly process document images, saving time and computational resources. This is particularly important for large-scale document image processing tasks.

Future Outlook for Dolphin

The open-sourcing of Dolphin injects new vitality into the field of document image parsing, and its future prospects are promising.

As technology evolves, Dolphin is expected to achieve further breakthroughs in several areas. One area is the expansion of language support. Currently, it primarily supports Chinese and English, but it can be extended to other languages to meet global document image parsing needs.

Another area is the enhancement of its ability to understand complex document images. Although Dolphin can already handle various complex document elements, the complexity of document images is limitless. There are still many special cases that require optimization. For example, documents with artistic fonts, complex layouts, multiple color overlays, and a mix of handwritten and printed text can be further improved in parsing algorithms to increase adaptability to complex situations.

Furthermore, Dolphin can achieve deeper integration with other technologies. It can be more closely combined with natural language processing technologies, data visualization technologies, knowledge graph technologies, etc. For instance, it can perform semantic analysis and sentiment analysis on parsed text content, visualize table data, and construct knowledge graphs from the knowledge points in documents, providing users with practical intelligent solutions.

Finally, further optimization of model performance and efficiency is expected. By improving model architecture, optimizing training methods, and applying hardware acceleration technologies, Dolphin can achieve faster parsing speeds and higher parsing accuracy while reducing computational resource requirements. This will enable its application in more devices and scenarios.

In summary, Dolphin, the open-sourced multimodal document image parsing model from ByteDance, boasts an innovative architecture, powerful functionality, and efficient performance. It holds a significant position in the field of document image parsing and offers broad application prospects. Whether for academic research or industrial applications, Dolphin provides valuable tools and technical support, driving the continuous advancement of document image parsing technologies.