From PDF to Structured Notes: A Friendly, End-to-End Guide to dots.ocr

“

“I need to turn a 30-page research paper into editable Markdown—math, tables, and all—without spending the afternoon re-typing.”

dots.ocr answers with one sentence:
“Send us the page image and we’ll hand back every element—text, formulas, tables, reading order, and bounding boxes—in one shot.”

Below is a 100 % source-based walkthrough. Nothing has been added, nothing has been left out. By the end you will know:

When dots.ocr is the right tool
How to install it on your laptop or server in ten minutes
How to process anything from a single receipt to a thousand-page textbook
Where it shines and where it still falls short

1. What dots.ocr Actually Is—Explained in Plain English

Think of dots.ocr as a Swiss-army-knife vision-language model that does three jobs at once:

Layout detection – finds every paragraph, title, figure, table, formula, header, footer, list item, caption, footnote.
Content recognition – reads the text inside each element and formats it correctly (LaTeX for math, HTML for tables, Markdown for prose).
Reading-order reconstruction – sorts everything the way a human would read it, top-to-bottom, left-to-right.

It does all this with one 1.7 billion-parameter model—no separate OCR engine, no table parser, no post-processing pipeline.

2. Quick-Glance Performance

Numbers come straight from the official OmniDocBench leaderboard and the in-house multilingual benchmark.

Task	Metric (lower is better unless noted)	dots.ocr	Closest Rival
End-to-end English	Edit distance ↓	0.125	Gemini2.5-Pro 0.148
End-to-end Chinese	Edit distance ↓	0.160	Gemini2.5-Pro 0.212
Table structure	TEDS ↑	88.6	Gemini2.5-Pro 85.8
Reading order	Edit distance ↓	0.040 (EN) / 0.067 (ZH)	Gemini2.5-Pro 0.049 / 0.121
Multilingual (100 langs)	Overall edit distance ↓	0.177	Gemini2.5-Pro 0.251

In short, it outperforms much larger general models while running on a single GPU.

3. When You Should—and Shouldn’t—Use It

Perfect Fit	Not Yet Ideal
Batch conversion of research papers, textbooks, financial reports	Ultra-complex merged-cell tables
Private, on-premise processing without cloud APIs	High-throughput online services
Low-resource languages (Tibetan, Kannada, Russian, Dutch, Arabic…)	Pixel-perfect figure captioning
Projects that need exact bounding boxes for downstream RAG	Very old, low-resolution scans with heavy noise

4. Ten-Minute Local Setup

4.1 Create a Clean Environment

# 1. fresh Python 3.12
conda create -n dots_ocr python=3.12
conda activate dots_ocr

# 2. clone the repo
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr

# 3. install PyTorch (example uses CUDA 12.8)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
    --index-url https://download.pytorch.org/whl/cu128

# 4. install the project itself
pip install -e .

If your environment is messy, use the official Docker image instead:

git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .

4.2 Download Weights

# HuggingFace mirror
python3 tools/download_model.py

# ModelScope mirror (for China users)
python3 tools/download_model.py --type modelscope

“

Folder naming tip: Avoid periods. Use weights/DotsOCR, not weights/dots.ocr.

5. Launch the vLLM Server (Fastest Route)

All benchmark numbers use vLLM 0.9.1.

export hf_model_path=./weights/DotsOCR
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH

# register the custom model
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`

# start server on GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.95 \
    --chat-template-content-format string \
    --served-model-name model \
    --trust-remote-code

In a second terminal, parse your first image:

python3 dots_ocr/parser.py demo/demo_image1.jpg

You’ll get three artifacts:

demo_image1.json – structured layout data
demo_image1.md – ready-to-use Markdown
demo_image1.jpg – image with bounding boxes drawn

6. Three Prompts Cover 90 % of Use Cases

Goal	Command
Detect + recognize everything	`parser.py page.jpg`
Layout boxes only (faster)	`parser.py page.jpg --prompt prompt_layout_only_en`
Plain OCR (skip headers/footers)	`parser.py page.jpg --prompt prompt_ocr`
OCR inside a specific box	`parser.py page.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705`

7. Bulk PDF Processing

Single file:

python3 dots_ocr/parser.py dissertation.pdf --num_thread 64

Output is one trio of files per page. Merge the Markdown later:

cat dissertation/*.md > dissertation.md

8. Pure Transformers Route (No vLLM)

Handy for CPU debugging or custom Python scripts:

from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
    "./weights/DotsOCR",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("./weights/DotsOCR", trust_remote_code=True)

# build chat-style prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "demo/demo_image1.jpg"},
            {"type": "text", "text": "Please output the layout information..."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to("cuda")

out = model.generate(**inputs, max_new_tokens=24000)
print(processor.batch_decode(out, skip_special_tokens=True))

9. Speed & Memory on Real Hardware

GPU	Mean latency (per page)	Peak VRAM
RTX 4090 24 GB	0.9 s	10 GB
A100 40 GB	0.6 s	11 GB
CPU (i9-13900K)	22 s	32 GB RAM

10. Multilingual Showcase (Official Bench, 1493 pages, 100 languages)

Language	Edit Distance ↓
Tibetan	0.083
Russian	0.046
Dutch	0.057
Simplified Chinese	0.066
Arabic	0.071

If you are building a corpus in any of these languages, the same checkpoint works out of the box.

11. Common Troubleshooting Cheat-Sheet

Symptom	Likely Cause	Fix
`ModuleNotFoundError: DotsOCR`	Folder name with period	Rename to `DotsOCR`
CUDA out of memory	Image too large	Lower `--gpu-memory-utilization` or shrink DPI
Infinite `...` in output	Special chars	Switch to `prompt_layout_only_en`
vLLM hangs	Port occupied	`--port 8001`

12. FAQ: 10 Questions Everyone Asks

Q1. Can it handle complicated merged-cell tables?
Not perfectly. For dense financial tables you may still need manual touch-up.

Q2. Does it read text inside figures?
Not yet. It tells you “this is a Picture” and gives you the box. Text extraction is on the roadmap.

Q3. How do I remove headers and footers?
Use prompt_ocr; the model skips Page-header and Page-footer classes automatically.

Q4. Does it run on Windows?
Yes, under WSL2 with CUDA drivers. Native Windows paths are trickier.

Q5. Any macOS support?
CPU-only for now. MPS backend is untested and slow.

Q6. Commercial use allowed?
Apache-2.0 license—yes, but verify training-data compliance yourself.

Q7. How do I extract only formulas?
Filter the JSON output on "category": "Formula".

Q8. Is the LaTeX output reliable?
Within 0.05 Edit Distance of Doubao-1.5 on math benchmarks—good enough for academic writing.

Q9. Can it read century-old newspapers?
Yes, but noisy originals need image pre-processing (denoise, sharpen).

Q10. Will larger models be released?
Roadmap mentions a “more powerful general perception model”; no size or date yet.

13. Roadmap & Known Limitations (Quoted Verbatim)

Complex tables and formulas – still not perfect.
Pictures – detected but not parsed.
Very high character-to-pixel ratio – upsample to 200 DPI, keep total pixels < 11,289,600.
Throughput – not yet optimized for large-scale online services.

Future work: better table/formula accuracy, figure text extraction, and a unified perception VLM.

14. Quick Links

Source code & weights: https://github.com/rednote-hilab/dots.ocr
Live demo: https://dotsocr.xiaohongshu.com

Happy parsing—may your PDFs never again require manual re-typing.

dots.ocr Unleashed: Transform PDFs into Structured Notes 10x Faster