From PDF to Structured Notes: A Friendly, End-to-End Guide to dots.ocr
“
“I need to turn a 30-page research paper into editable Markdown—math, tables, and all—without spending the afternoon re-typing.”
dots.ocr answers with one sentence:
“Send us the page image and we’ll hand back every element—text, formulas, tables, reading order, and bounding boxes—in one shot.”
Below is a 100 % source-based walkthrough. Nothing has been added, nothing has been left out. By the end you will know:
-
When dots.ocr is the right tool -
How to install it on your laptop or server in ten minutes -
How to process anything from a single receipt to a thousand-page textbook -
Where it shines and where it still falls short
1. What dots.ocr Actually Is—Explained in Plain English
Think of dots.ocr as a Swiss-army-knife vision-language model that does three jobs at once:
-
Layout detection – finds every paragraph, title, figure, table, formula, header, footer, list item, caption, footnote. -
Content recognition – reads the text inside each element and formats it correctly (LaTeX for math, HTML for tables, Markdown for prose). -
Reading-order reconstruction – sorts everything the way a human would read it, top-to-bottom, left-to-right.
It does all this with one 1.7 billion-parameter model—no separate OCR engine, no table parser, no post-processing pipeline.
2. Quick-Glance Performance
Numbers come straight from the official OmniDocBench leaderboard and the in-house multilingual benchmark.
In short, it outperforms much larger general models while running on a single GPU.
3. When You Should—and Shouldn’t—Use It
4. Ten-Minute Local Setup
4.1 Create a Clean Environment
# 1. fresh Python 3.12
conda create -n dots_ocr python=3.12
conda activate dots_ocr
# 2. clone the repo
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
# 3. install PyTorch (example uses CUDA 12.8)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 \
--index-url https://download.pytorch.org/whl/cu128
# 4. install the project itself
pip install -e .
If your environment is messy, use the official Docker image instead:
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .
4.2 Download Weights
# HuggingFace mirror
python3 tools/download_model.py
# ModelScope mirror (for China users)
python3 tools/download_model.py --type modelscope
“
Folder naming tip: Avoid periods. Use
weights/DotsOCR
, notweights/dots.ocr
.
5. Launch the vLLM Server (Fastest Route)
All benchmark numbers use vLLM 0.9.1.
export hf_model_path=./weights/DotsOCR
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
# register the custom model
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`
# start server on GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--chat-template-content-format string \
--served-model-name model \
--trust-remote-code
In a second terminal, parse your first image:
python3 dots_ocr/parser.py demo/demo_image1.jpg
You’ll get three artifacts:
-
demo_image1.json
– structured layout data -
demo_image1.md
– ready-to-use Markdown -
demo_image1.jpg
– image with bounding boxes drawn
6. Three Prompts Cover 90 % of Use Cases
7. Bulk PDF Processing
Single file:
python3 dots_ocr/parser.py dissertation.pdf --num_thread 64
Output is one trio of files per page. Merge the Markdown later:
cat dissertation/*.md > dissertation.md
8. Pure Transformers Route (No vLLM)
Handy for CPU debugging or custom Python scripts:
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"./weights/DotsOCR",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("./weights/DotsOCR", trust_remote_code=True)
# build chat-style prompt
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "demo/demo_image1.jpg"},
{"type": "text", "text": "Please output the layout information..."}
]
}
]
inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[inputs], images=[image], return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=24000)
print(processor.batch_decode(out, skip_special_tokens=True))
9. Speed & Memory on Real Hardware
10. Multilingual Showcase (Official Bench, 1493 pages, 100 languages)
If you are building a corpus in any of these languages, the same checkpoint works out of the box.
11. Common Troubleshooting Cheat-Sheet
12. FAQ: 10 Questions Everyone Asks
Q1. Can it handle complicated merged-cell tables?
Not perfectly. For dense financial tables you may still need manual touch-up.
Q2. Does it read text inside figures?
Not yet. It tells you “this is a Picture” and gives you the box. Text extraction is on the roadmap.
Q3. How do I remove headers and footers?
Use prompt_ocr
; the model skips Page-header and Page-footer classes automatically.
Q4. Does it run on Windows?
Yes, under WSL2 with CUDA drivers. Native Windows paths are trickier.
Q5. Any macOS support?
CPU-only for now. MPS backend is untested and slow.
Q6. Commercial use allowed?
Apache-2.0 license—yes, but verify training-data compliance yourself.
Q7. How do I extract only formulas?
Filter the JSON output on "category": "Formula"
.
Q8. Is the LaTeX output reliable?
Within 0.05 Edit Distance of Doubao-1.5 on math benchmarks—good enough for academic writing.
Q9. Can it read century-old newspapers?
Yes, but noisy originals need image pre-processing (denoise, sharpen).
Q10. Will larger models be released?
Roadmap mentions a “more powerful general perception model”; no size or date yet.
13. Roadmap & Known Limitations (Quoted Verbatim)
-
Complex tables and formulas – still not perfect. -
Pictures – detected but not parsed. -
Very high character-to-pixel ratio – upsample to 200 DPI, keep total pixels < 11,289,600. -
Throughput – not yet optimized for large-scale online services.
Future work: better table/formula accuracy, figure text extraction, and a unified perception VLM.
14. Quick Links
-
Source code & weights: https://github.com/rednote-hilab/dots.ocr -
Live demo: https://dotsocr.xiaohongshu.com
Happy parsing—may your PDFs never again require manual re-typing.