From PDF to Structured Notes: A Friendly, End-to-End Guide to dots.ocr

“I need to turn a 30-page research paper into editable Markdown—math, tables, and all—without spending the afternoon re-typing.”

dots.ocr answers with one sentence:
“Send us the page image and we’ll hand back every element—text, formulas, tables, reading order, and bounding boxes—in one shot.”

Below is a 100 % source-based walkthrough. Nothing has been added, nothing has been left out. By the end you will know:

  • When dots.ocr is the right tool
  • How to install it on your laptop or server in ten minutes
  • How to process anything from a single receipt to a thousand-page textbook
  • Where it shines and where it still falls short

1. What dots.ocr Actually Is—Explained in Plain English

Think of dots.ocr as a Swiss-army-knife vision-language model that does three jobs at once:

  1. Layout detection – finds every paragraph, title, figure, table, formula, header, footer, list item, caption, footnote.
  2. Content recognition – reads the text inside each element and formats it correctly (LaTeX for math, HTML for tables, Markdown for prose).
  3. Reading-order reconstruction – sorts everything the way a human would read it, top-to-bottom, left-to-right.

It does all this with one 1.7 billion-parameter model—no separate OCR engine, no table parser, no post-processing pipeline.


2. Quick-Glance Performance

Numbers come straight from the official OmniDocBench leaderboard and the in-house multilingual benchmark.

Task Metric (lower is better unless noted) dots.ocr Closest Rival
End-to-end English Edit distance ↓ 0.125 Gemini2.5-Pro 0.148
End-to-end Chinese Edit distance ↓ 0.160 Gemini2.5-Pro 0.212
Table structure TEDS ↑ 88.6 Gemini2.5-Pro 85.8
Reading order Edit distance ↓ 0.040 (EN) / 0.067 (ZH) Gemini2.5-Pro 0.049 / 0.121
Multilingual (100 langs) Overall edit distance ↓ 0.177 Gemini2.5-Pro 0.251

In short, it outperforms much larger general models while running on a single GPU.


3. When You Should—and Shouldn’t—Use It

Perfect Fit Not Yet Ideal
Batch conversion of research papers, textbooks, financial reports Ultra-complex merged-cell tables
Private, on-premise processing without cloud APIs High-throughput online services
Low-resource languages (Tibetan, Kannada, Russian, Dutch, Arabic…) Pixel-perfect figure captioning
Projects that need exact bounding boxes for downstream RAG Very old, low-resolution scans with heavy noise

4. Ten-Minute Local Setup

4.1 Create a Clean Environment

# 1. fresh Python 3.12
conda create -n dots_ocr python=3.12
conda activate dots_ocr

# 2. clone the repo
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# 3. install PyTorch (example uses CUDA 12.8)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
    --index-url https://download.pytorch.org/whl/cu128

# 4. install the project itself
pip install -e .

If your environment is messy, use the official Docker image instead:

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

4.2 Download Weights

# HuggingFace mirror
python3 tools/download_model.py

# ModelScope mirror (for China users)
python3 tools/download_model.py --type modelscope

Folder naming tip: Avoid periods. Use weights/DotsOCR, not weights/dots.ocr.


5. Launch the vLLM Server (Fastest Route)

All benchmark numbers use vLLM 0.9.1.

export hf_model_path=./weights/DotsOCR
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH

# register the custom model
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`

# start server on GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.95 \
    --chat-template-content-format string \
    --served-model-name model \
    --trust-remote-code

In a second terminal, parse your first image:

python3 dots_ocr/parser.py demo/demo_image1.jpg

You’ll get three artifacts:

  • demo_image1.json – structured layout data
  • demo_image1.md – ready-to-use Markdown
  • demo_image1.jpg – image with bounding boxes drawn

6. Three Prompts Cover 90 % of Use Cases

Goal Command
Detect + recognize everything parser.py page.jpg
Layout boxes only (faster) parser.py page.jpg --prompt prompt_layout_only_en
Plain OCR (skip headers/footers) parser.py page.jpg --prompt prompt_ocr
OCR inside a specific box parser.py page.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705

7. Bulk PDF Processing

Single file:

python3 dots_ocr/parser.py dissertation.pdf --num_thread 64

Output is one trio of files per page. Merge the Markdown later:

cat dissertation/*.md > dissertation.md

8. Pure Transformers Route (No vLLM)

Handy for CPU debugging or custom Python scripts:

from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
    "./weights/DotsOCR",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("./weights/DotsOCR", trust_remote_code=True)

# build chat-style prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "demo/demo_image1.jpg"},
            {"type": "text", "text": "Please output the layout information..."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to("cuda")

out = model.generate(**inputs, max_new_tokens=24000)
print(processor.batch_decode(out, skip_special_tokens=True))

9. Speed & Memory on Real Hardware

GPU Mean latency (per page) Peak VRAM
RTX 4090 24 GB 0.9 s 10 GB
A100 40 GB 0.6 s 11 GB
CPU (i9-13900K) 22 s 32 GB RAM

10. Multilingual Showcase (Official Bench, 1493 pages, 100 languages)

Language Edit Distance ↓
Tibetan 0.083
Russian 0.046
Dutch 0.057
Simplified Chinese 0.066
Arabic 0.071

If you are building a corpus in any of these languages, the same checkpoint works out of the box.


11. Common Troubleshooting Cheat-Sheet

Symptom Likely Cause Fix
ModuleNotFoundError: DotsOCR Folder name with period Rename to DotsOCR
CUDA out of memory Image too large Lower --gpu-memory-utilization or shrink DPI
Infinite ... in output Special chars Switch to prompt_layout_only_en
vLLM hangs Port occupied --port 8001

12. FAQ: 10 Questions Everyone Asks

Q1. Can it handle complicated merged-cell tables?
Not perfectly. For dense financial tables you may still need manual touch-up.

Q2. Does it read text inside figures?
Not yet. It tells you “this is a Picture” and gives you the box. Text extraction is on the roadmap.

Q3. How do I remove headers and footers?
Use prompt_ocr; the model skips Page-header and Page-footer classes automatically.

Q4. Does it run on Windows?
Yes, under WSL2 with CUDA drivers. Native Windows paths are trickier.

Q5. Any macOS support?
CPU-only for now. MPS backend is untested and slow.

Q6. Commercial use allowed?
Apache-2.0 license—yes, but verify training-data compliance yourself.

Q7. How do I extract only formulas?
Filter the JSON output on "category": "Formula".

Q8. Is the LaTeX output reliable?
Within 0.05 Edit Distance of Doubao-1.5 on math benchmarks—good enough for academic writing.

Q9. Can it read century-old newspapers?
Yes, but noisy originals need image pre-processing (denoise, sharpen).

Q10. Will larger models be released?
Roadmap mentions a “more powerful general perception model”; no size or date yet.


13. Roadmap & Known Limitations (Quoted Verbatim)

  • Complex tables and formulas – still not perfect.
  • Pictures – detected but not parsed.
  • Very high character-to-pixel ratio – upsample to 200 DPI, keep total pixels < 11,289,600.
  • Throughput – not yet optimized for large-scale online services.

Future work: better table/formula accuracy, figure text extraction, and a unified perception VLM.


14. Quick Links

  • Source code & weights: https://github.com/rednote-hilab/dots.ocr
  • Live demo: https://dotsocr.xiaohongshu.com

Happy parsing—may your PDFs never again require manual re-typing.