Site icon Efficient Coder

MonkeyOCR: Revolutionizing Document Parsing with SRR Triplet Efficiency

MonkeyOCR: Revolutionizing Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

In the digital age, document parsing technology has become indispensable. Whether for academic research, business analysis, or daily office work, we need efficient and accurate tools to extract key information from various documents. Today, I am thrilled to introduce MonkeyOCR, a document parsing tool that adopts a unique Structure-Recognition-Relation (SRR) triplet paradigm, offering a fresh solution to document parsing challenges.

What is MonkeyOCR?

MonkeyOCR is a document parsing tool developed by researchers Zhang Li, Yuliang Liu, and others. It introduces the innovative SRR (Structure-Recognition-Relation) triplet paradigm, aiming to simplify the multi-tool pipeline of traditional modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.

Advantages of MonkeyOCR


  • High Efficiency: Compared to the pipeline-based method MinerU, MonkeyOCR achieves an average improvement of 5.1% across nine types of Chinese and English documents. It delivers a 15.0% gain on formulas and an 8.6% gain on tables. Against end-to-end models, its 3B-parameter model outperforms competitors like Gemini 2.5 Pro and Qwen2.5 VL-72B, achieving the best average performance on English documents. In terms of multi-page document parsing speed, MonkeyOCR processes 0.84 pages per second, surpassing MinerU (0.65) and Qwen2.5 VL-7B (0.12).

  • Support for Various Document Types: MonkeyOCR currently supports parsing of various Chinese and English document types, including books, slides, financial reports, textbooks, exam papers, magazines, academic papers, notes, and newspapers.

Quick Installation and Usage

Environment Setup


  • Create and activate a Conda environment: Use the following commands to create a Python 3.10 environment named MonkeyOCR and activate it.
conda create -n MonkeyOCR python=3.10
conda activate MonkeyOCR

  • Clone the project repository: Clone the MonkeyOCR project repository to your local machine.
git clone https://github.com/Yuliang-Liu/MonkeyOCR.git
cd MonkeyOCR

  • Install PyTorch: Based on your CUDA version, refer to the PyTorch official website to find the appropriate installation command. For example, if you are using CUDA 12.4, you can install PyTorch 2.5.1, torchvision 0.20.1, and torchaudio 2.5.1 with the following command.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

  • Install project dependencies: In the project root directory, run the following command to install the dependencies required by the project.
pip install .

Download Model Weights


  • Install HuggingFace Hub: Install the HuggingFace Hub library using pip.
pip install huggingface_hub

  • Download model weights: Run the following command to download the model weight files for MonkeyOCR.
python download_model.py

Running Inference Tasks


  • Basic inference command: Ensure you are in the MonkeyOCR project root directory and run the following command to parse a specified PDF file.
python parse.py path/to/your.pdf

  • Specify model path and configuration file: If you have already downloaded the MonkeyChat path and model configuration file, you can use the following command for inference.
python parse.py path/to/your.pdf -m model_weight/Recognition -c config.yaml

Using the Gradio Demo


  • Install Gradio and pdf2image: Install the Gradio and pdf2image libraries in your environment to support the demo functionality.
pip install gradio==5.23.3
pip install pdf2image==1.17.0

  • Start the demo: Run the following command to launch the Gradio demo service.
python demo/demo_gradio.py

With this setup, you can upload PDF or image files through the Gradio interface and intuitively observe the parsing results of MonkeyOCR.

Benchmark Test Results

End-to-End Evaluation Results for Different Tasks

On the OmniDocBench dataset, MonkeyOCR-3B and MonkeyOCR-3B* deliver outstanding performance across multiple task metrics. The following table presents the performance of various models on different tasks:

Model Type Methods Overall Edit↓ Overall Edit↓ Text Edit↓ Text Edit↓ Formula Edit↓ Formula Edit↓ Formula CDM↑ Formula CDM↑ Table TEDS↑ Table TEDS↑ Table Edit↓ Table Edit↓ Read Order Edit↓ Read Order Edit↓
Model Type Methods EN ZH EN ZH EN ZH EN ZH EN ZH EN ZH EN ZH
Pipeline Tools MinerU 0.150 0.357 0.061 0.215 0.278 0.577 57.3 42.9 78.6 62.1 0.180 0.344 0.079 0.292
Pipeline Tools Marker 0.336 0.556 0.080 0.315 0.530 0.883 17.6 11.7 67.6 49.2 0.619 0.685 0.114 0.340
Pipeline Tools Mathpix 0.191 0.365 0.105 0.384 0.306 0.454 62.7 62.1 77.0 67.1 0.243 0.320 0.108 0.304
Pipeline Tools Docling 0.589 0.909 0.416 0.987 0.999 1 61.3 25.0 0.627 0.810 0.313 0.837
Pipeline Tools Pix2Text 0.320 0.528 0.138 0.356 0.276 0.611 78.4 39.6 73.6 66.2 0.584 0.645 0.281 0.499
Pipeline Tools Unstructured 0.586 0.716 0.198 0.481 0.999 1 0 0.06 1 0.998 0.145 0.387
Pipeline Tools OpenParse 0.646 0.814 0.681 0.974 0.996 1 0.11 0 64.8 27.5 0.284 0.639 0.595 0.641
Expert VLMs GOT-OCR 0.287 0.411 0.189 0.315 0.360 0.528 74.3 45.3 53.2 47.2 0.459 0.520 0.141 0.280
Expert VLMs Nougat 0.452 0.973 0.365 0.998 0.488 0.941 15.1 16.8 39.9 0 0.572 1.000 0.382 0.954
Expert VLMs Mistral OCR 0.268 0.439 0.072 0.325 0.318 0.495 64.6 45.9 75.8 63.6 0.600 0.650 0.083 0.284
Expert VLMs OLMOCR-sglang 0.326 0.469 0.097 0.293 0.455 0.655 74.3 43.2 68.1 61.3 0.608 0.652 0.145 0.277
Expert VLMs SmolDocling-256M 0.493 0.816 0.262 0.838 0.753 0.997 32.1 0.55 44.9 16.5 0.729 0.907 0.227 0.522
General VLMs GPT4o 0.233 0.399 0.144 0.409 0.425 0.606 72.8 42.8 72.0 62.9 0.234 0.329 0.128 0.251
General VLMs Qwen2.5-VL-7B 0.312 0.406 0.157 0.228 0.351 0.574 79.0 50.2 76.4 72.2 0.588 0.619 0.149 0.203
General VLMs InternVL3-8B 0.314 0.383 0.134 0.218 0.417 0.563 78.3 49.3 66.1 73.1 0.586 0.564 0.118 0.186
Mix MonkeyOCR-3B [Weight] 0.140 0.297 0.058 0.185 0.238 0.506 78.7 51.4 80.2 77.7 0.170 0.253 0.093 0.244
Mix MonkeyOCR-3B* [Weight] 0.154 0.277 0.073 0.134 0.255 0.529 78.5 50.8 78.2 76.2 0.182 0.262 0.105 0.183

As shown in the table, MonkeyOCR-3B achieves the lowest values in Overall Edit (overall edit distance) for both English and Chinese, at 0.140 and 0.297 respectively, indicating its superior accuracy in overall document parsing. In Text Edit (text edit distance), MonkeyOCR-3B also attains the lowest values of 0.058 (English) and 0.185 (Chinese), demonstrating its high accuracy in text recognition. For Formula Edit (formula edit distance), while MonkeyOCR-3B does not have the lowest value in English, it performs excellently in Chinese. In Formula CDM (formula character match rate), MonkeyOCR-3B achieves 78.7% (English) and 51.4% (Chinese), with the highest score in English. In Table TEDS (Table TEDS metric), MonkeyOCR-3B secures the highest scores of 80.2% (English) and 77.7% (Chinese), showcasing its strong table parsing capability. In Table Edit (table edit distance) and Read Order Edit (read order edit distance), MonkeyOCR-3B likewise achieves the lowest values, at 0.170 (English), 0.253 (Chinese), 0.093 (English), and 0.244 (Chinese).

Text Recognition Performance Across Nine PDF Page Types

The following table illustrates the text recognition performance of various models across nine PDF page types:

Model Type Models Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall
Pipeline Tools MinerU 0.055 0.124 0.033 0.102 0.159 0.072 0.025 0.984 0.171 0.206
Pipeline Tools Marker 0.074 0.340 0.089 0.319 0.452 0.153 0.059 0.651 0.192 0.274
Pipeline Tools Mathpix 0.131 0.220 0.202 0.216 0.278 0.147 0.091 0.634 0.690 0.300
Expert VLMs GOT-OCR 0.111 0.222 0.067 0.132 0.204 0.198 0.179 0.388 0.771 0.267
Expert VLMs Nougat 0.734 0.958 1.000 0.820 0.930 0.830 0.214 0.991 0.871 0.806
General VLMs GPT4o 0.157 0.163 0.348 0.187 0.281 0.173 0.146 0.607 0.751 0.316
General VLMs Qwen2.5-VL-7B 0.148 0.053 0.111 0.137 0.189 0.117 0.134 0.204 0.706 0.205
General VLMs InternVL3-8B 0.163 0.056 0.107 0.109 0.129 0.100 0.159 0.150 0.681 0.188
Mix MonkeyOCR-3B [Weight] 0.046 0.120 0.024 0.100 0.129 0.086 0.024 0.643 0.131 0.155
Mix MonkeyOCR-3B* [Weight] 0.054 0.203 0.038 0.112 0.138 0.111 0.032 0.194 0.136 0.120

In terms of text recognition for specific document types such as books, financial reports, textbooks, and exam papers, MonkeyOCR-3B and MonkeyOCR-3B* achieve the lowest edit distance values, indicating their superior text recognition accuracy for these particular document types. For instance, in the book page type category, MonkeyOCR-3B boasts an edit distance of 0.046, the lowest among all models. In the financial report page type, its edit distance is 0.024, again the lowest. Moreover, in overall performance, MonkeyOCR-3B leads with an edit distance of 0.155, showcasing its exceptional performance across diverse document types.

MonkeyOCR Visualization Demo

To provide a more intuitive understanding of MonkeyOCR’s capabilities, the project team has launched an online demo platform at http://vlrlabmonkey.xyz:7685. You can follow these steps to experience MonkeyOCR’s document parsing power:

  1. Upload a PDF or image file: Click the upload button and select the document file you wish to parse.
  2. Parse the document: Click the “Parse” button, and MonkeyOCR will perform structure detection, content recognition, and relationship prediction on the input document, outputting the results in markdown format.
  3. Test prompts: You can select preset prompts and click the “Test by prompt” button to enable the model to recognize image content based on the selected prompt.

Through this demo platform, you can clearly observe how MonkeyOCR converts complex document content into structured markdown text, facilitating further editing and utilization.

FAQs

What document types does MonkeyOCR support?

MonkeyOCR currently supports various Chinese and English document types, including books, slides, financial reports, textbooks, exam papers, magazines, academic papers, notes, and newspapers.

What are the advantages of MonkeyOCR over other document parsing tools?

MonkeyOCR adopts the unique SRR (Structure-Recognition-Relation) triplet paradigm, simplifying the multi-tool pipeline of traditional modular approaches and avoiding the inefficiency of large multimodal models when processing full-page documents. In terms of performance, MonkeyOCR achieves superior accuracy and efficiency compared to both pipeline-based tools and end-to-end models.

How can I download MonkeyOCR’s model weights?

You can download the model weight files for MonkeyOCR using the following commands:

pip install huggingface_hub
python download_model.py

What is the inference command for MonkeyOCR?

Ensure you are in the MonkeyOCR project root directory and run the following command to parse a specified PDF file:

python parse.py path/to/your.pdf

If you have already downloaded the MonkeyChat path and model configuration file, you can use the following command for inference:

python parse.py path/to/your.pdf -m model_weight/Recognition -c config.yaml

How do I start the Gradio demo for MonkeyOCR?

First, install Gradio and pdf2image in your environment using the following commands:

pip install gradio==5.23.3
pip install pdf2image==1.17.0

Then, launch the Gradio demo service with the following command:

python demo/demo_gradio.py

Does MonkeyOCR support handwritten documents?

Currently, MonkeyOCR does not support the parsing of handwritten documents. However, the project team is committed to enhancing this functionality in future updates.

Summary

MonkeyOCR, with its SRR triplet paradigm, stands out as a powerful document parsing tool in the digital era. It offers significant advantages in efficiency and accuracy, making it an excellent choice for document parsing tasks in academic research, business analysis, and daily office work. Through this article, I hope you have gained a comprehensive understanding of MonkeyOCR. If you are interested in document parsing technology, I encourage you to try this tool and experience its efficiency and effectiveness firsthand.

Should you encounter any issues while using MonkeyOCR or have any feedback or suggestions, you can contact the project team via the following email addresses:


  • Email: xbai@hust.edu.cn or ylliu@hust.edu.cn

I hope MonkeyOCR proves to be a valuable asset in your document parsing endeavors, enabling you to extract and utilize key information from documents more efficiently.

Exit mobile version