The Challenge of Modern Document Conversion
In our increasingly digital world, the ability to accurately convert physical documents into editable digital formats has become essential. From academic research papers and technical manuals to financial reports and legal documents, we regularly encounter materials that contain complex elements like multi-column layouts, structured tables, and mathematical formulas.
Traditional approaches to this problem have typically followed one of two paths:
-
Pipeline methods that combine multiple specialized tools -
End-to-end models trained through knowledge distillation from larger models
Both approaches have significant limitations. Pipeline methods require stitching together different components for text recognition, table extraction, and formula parsing, often resulting in inconsistent output and error accumulation. Distillation-based methods, while more integrated, inherently limit the student model to the capabilities of the teacher model and can propagate errors through the system.
POINTS-Reader represents a fundamental shift in approach—a vision-language model specifically designed for document conversion that completely eliminates the need for distillation training while achieving state-of-the-art performance.
What Makes POINTS-Reader Different?
POINTS-Reader introduces a novel two-stage framework that generates high-quality training data without relying on existing models for distillation. This approach addresses the core limitations of previous methods:
-
No performance ceiling: Unlike distillation, where student models can never exceed teacher capabilities -
No error propagation: Avoids inheriting and amplifying mistakes from teacher models -
Greater innovation potential: Enables true architectural and methodological advances
The model builds upon the POINTS1.5 architecture but replaces Qwen2.5-7B-Instruct with the more efficient Qwen2.5-3B-Instruct, maintaining performance while improving throughput. It uses a 600M parameter NaViT visual encoder, carefully chosen to balance capability with computational efficiency.
The Two-Stage Training Framework
Stage 1: Uniform Format Warm-up
The first stage addresses a fundamental challenge in document conversion: diverse elements require different output formats. POINTS-Reader establishes consistent formatting rules:
-
Plain text: Rendered in Markdown format -
Tables: Represented using HTML (chosen for its ability to handle complex structures like merged cells) -
Mathematical formulas: Expressed in LaTeX syntax with inline () and display (…) notation
With these unified formats established, the system generates diverse synthetic data through a sophisticated process:
-
Content generation: A large language model creates text based on carefully designed prompts across four categories: -
Plain text only -
Text with mathematical formulas -
Text with tables
-
-
Multi-column layouts with tables
-
Quality filtering: Rule-based filters ensure generated tables and formulas meet structural and syntactic requirements
-
Rendering: Filtered content is converted to HTML and rendered into images using Chrome’s headless mode
-
Training: The resulting image-text pairs fine-tune a general vision-language model
This process generates over 200,000 samples per category, creating a diverse dataset that teaches the model to output document elements in a consistent format.
Stage 2: Iterative Self-Improvement
While synthetic data enables large-scale training, its distribution often differs from real-world documents. The second stage bridges this gap through an iterative self-improvement process:
-
Inference: The model from Stage 1 processes real documents from the DocMatrix dataset (over 2 million document images) -
Filtering: Multiple validation strategies automatically identify and retain high-quality outputs -
Retraining: Filtered data is used to train an improved model version -
Repetition: The process repeats through multiple iterations, progressively enhancing both model performance and data quality
This approach creates a virtuous cycle where each iteration produces better training data, which in turn trains a better model.
Advanced Data Filtering Strategies
The success of the self-improvement stage hinges on effective filtering strategies that ensure only high-quality data is retained for training.
Text Filtering Using F1-Score
For plain text extraction, the system employs a sophisticated F1-score based approach:
-
Traditional OCR (PaddleOCR) extracts reference text from images -
Both model predictions and references are normalized by removing non-alphanumeric characters and splitting by spaces -
The system counts occurrences of each unit and calculates precision, recall, and F1-score -
Samples with F1-scores below a threshold (typically 0.9) are discarded
This approach effectively identifies and removes samples with hallucinations, repetitions, or omissions—common issues in complex layout parsing.
Structural Validation for Tables
Rather than relying on potentially unreliable table structure recognition models, POINTS-Reader implements a rule-based approach focused on structural validity:
-
Verification of consistent cell counts across rows and columns -
Validation of HTML syntax compliance -
Removal of samples with invalid table structures
This method ensures that retained tables are structurally sound, even if the content isn’t explicitly validated.
Syntax Checking for Mathematical Formulas
For mathematical formulas, the filtering process focuses exclusively on syntactic correctness:
-
Extraction of all formulas from model outputs -
LaTeX syntax validation -
Discarding of samples containing invalid formulas
This approach recognizes that while semantic validation would be ideal, syntactic correctness is a practical and effective filter for improving formula recognition.
Performance Excellence: Benchmark Results
POINTS-Reader demonstrates impressive performance across multiple benchmarks, outperforming both pipeline methods and other vision-language models.
OmniDocBench Results
On the comprehensive OmniDocBench evaluation, which assesses performance across text, formulas, tables, and reading order, POINTS-Reader achieves:
-
Overall score: 0.259 (lower is better) -
Text extraction: 0.176 edit distance -
Formula recognition: 0.383 edit distance -
Table extraction: 0.335 edit distance -
Reading order: 0.144 edit distance
These results are particularly impressive when compared to other approaches:
-
Vs. pipeline methods: POINTS-Reader outperforms MinerU (0.150), Marker (0.336), and Mathpix (0.191) -
Vs. general VLMs: It surpasses Qwen2.5-VL-72B (0.214) despite being significantly smaller -
Vs. expert OCR models: It exceeds GOT-OCR (0.287), Nougat (0.452), and Mistral OCR (0.268)
Specialized Strengths
POINTS-Reader shows particular excellence in table recognition, outperforming GOT-OCR by 0.197 on the table metric. This strength stems from its use of HTML for table representation, which better handles complex structures compared to the Markdown format used by many other approaches.
The model also supports both English and Chinese documents, with scores of 0.133 for English and 0.212 for Chinese on OmniDocBench—impressive results that demonstrate its multilingual capabilities.
Practical Implementation and Usage
Installation and Setup
POINTS-Reader requires the following environment configuration:
python==3.10.12
torch==2.5.1
transformers==4.55.2
cuda==12.1
First, install the WePOINTS library:
git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .
Basic Usage with Transformers
The model can be easily implemented using the Transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch
# Recommended prompt for optimal performance
prompt = (
'Please extract all the text from the image with the following requirements:\n'
'1. Return tables in HTML format.\n'
'2. Return all other text in Markdown format.'
)
image_path = '/path/to/your/local/image'
model_path = 'tencent/POINTS-Reader'
model = AutoModelForCausalLM.from_pretrained(model_path,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
dict(type='image', image=image_path),
dict(type='text', text=prompt)
]
messages = [
{
'role': 'user',
'content': content
}
]
generation_config = {
'max_new_tokens': 2048,
'repetition_penalty': 1.05,
'temperature': 0.7,
'top_p': 0.8,
'top_k': 20,
'do_sample': True
}
response = model.chat(
messages,
tokenizer,
image_processor,
generation_config
)
print(response)
For issues with repetitive output, try increasing the input image resolution.
High-Throughput Deployment with SGLang
For production environments requiring high throughput, POINTS-Reader supports deployment with SGLang:
python3 -m sglang.launch_server \
--model-path tencent/POINTS-Reader \
--tp-size 1 \
--dp-size 1 \
--chat-template points-v15-chat \
--trust-remote-code \
--port 8081
Once deployed, you can query the model using:
from typing import List
import requests
import json
def call_wepoints(messages: List[dict],
temperature: float = 0.0,
max_new_tokens: int = 2048,
repetition_penalty: float = 1.05,
top_p: float = 0.8,
top_k: int = 20,
do_sample: bool = True,
url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
"""Query the WePOINTS model for document conversion"""
data = {
'model': 'WePoints',
'messages': messages,
'max_new_tokens': max_new_tokens,
'temperature': temperature,
'repetition_penalty': repetition_penalty,
'top_p': top_p,
'top_k': top_k,
'do_sample': do_sample,
}
response = requests.post(url, json=data)
response = json.loads(response.text)
response = response['choices'][0]['message']['content']
return response
prompt = (
'Please extract all the text from the image with the following requirements:\n'
'1. Return tables in HTML format.\n'
'2. Return all other text in Markdown format.'
)
messages = [{
'role': 'user',
'content': [
{
'type': 'text',
'text': prompt
},
{
'type': 'image_url',
'image_url': {'url': '/path/to/image.jpg'}
}
]
}]
response = call_wepoints(messages)
print(response)
Real-World Application Examples
POINTS-Reader excels at processing diverse document types:
Single-Column Documents with LaTeX Formulas
The model accurately captures both text and complex mathematical expressions, maintaining proper formatting and syntax.

Single-Column Documents with Tables
HTML table representation ensures complex structures with merged cells are properly preserved.

Multi-Column Documents with LaTeX Formulas
The model maintains reading order and formatting across multiple columns while accurately capturing mathematical notation.

Multi-Column Documents with Tables
Even in complex multi-column layouts, POINTS-Reader accurately identifies and extracts table structures while maintaining proper HTML formatting.

Technical Insights from Ablation Studies
Comprehensive experiments validate the design choices in POINTS-Reader’s architecture and training approach.
The Importance of Data Diversity
Incremental addition of different data categories consistently improves performance:
Training Data | Text↓ | Table↓ | Formula↓ | Order↓ | Overall↓ |
---|---|---|---|---|---|
Baseline | 0.551 | 0.652 | 0.730 | 0.570 | 0.626 |
+ Text | 0.522 | 0.641 | 0.721 | 0.553 | 0.609 |
+ Formula | 0.513 | 0.640 | 0.600 | 0.530 | 0.571 |
+ Table | 0.495 | 0.590 | 0.595 | 0.523 | 0.551 |
+ Multi-Column | 0.485 | 0.572 | 0.511 | 0.471 | 0.510 |
Each category not only improves performance on its specific element type but also contributes to overall metrics, with multi-column data providing particularly significant gains in reading order accuracy.
The Diminishing Returns of Synthetic Data
Experiments reveal that model performance plateaus—and eventually declines—when synthetic data exceeds 800,000 samples. This underscores the fundamental distribution differences between synthetic and real-world documents and highlights the necessity of the iterative self-improvement stage.
Optimal Aspect Ratio Filtering
By filtering images with abnormal aspect ratios (outside the 2/5 to 5/9 range), model performance improves significantly. This optimization aligns with the standard A4 paper aspect ratio (√2 per ISO 216) that dominates real-world documents.
Current Limitations and Future Directions
Despite its impressive capabilities, POINTS-Reader has certain limitations that represent opportunities for future development:
-
Language support: Currently optimized for English and Chinese, with limited capabilities for other languages -
Handwritten text: Performance is suboptimal with handwritten content like notes and receipts -
Image extraction: The model currently focuses on text, tables, and formulas, without support for image extraction or localization within documents -
Complex layouts: Highly complex layouts (e.g., newspapers with irregular formatting) can still challenge the model
The development team has outlined several priorities for future work:
-
Expansion of multilingual support to cover more languages -
Improved handling of handwritten content through specialized training data -
Extension of capabilities to include image extraction and localization -
Enhanced processing of extremely complex document layouts
Frequently Asked Questions
How does POINTS-Reader differ from traditional OCR tools?
Traditional OCR tools typically focus only on text recognition and struggle with complex elements like tables and formulas. POINTS-Reader is an end-to-end vision-language model that simultaneously processes text, tables, and formulas, outputting structured formats without requiring multiple specialized tools.
What output formats does POINTS-Reader support?
The model uses unified output formats: Markdown for plain text, HTML for tables, and LaTeX for mathematical formulas. This consistent approach simplifies downstream processing and integration.
How can I improve conversion accuracy for difficult documents?
For challenging documents, try increasing the input image resolution to provide more visual information to the model. Adjusting generation parameters like temperature, top_p, and top_k may also help with specific issues like repetition.
Does POINTS-Reader work with handwritten documents?
Current performance with handwritten documents is limited, as the training data primarily contains printed text. This is an area of active development for future versions.
What is the maximum document length POINTS-Reader can process?
The model supports a context length of 8192 tokens, which is sufficient for most single-page documents. For multi-page documents, we recommend processing one page at a time.
How computationally intensive is POINTS-Reader?
The model is designed for practical deployment, with a 3B parameter language model and 600M parameter visual encoder. With optimization frameworks like SGLang, it achieves satisfactory throughput for production use.
Conclusion
POINTS-Reader represents a significant advancement in document conversion technology by completely eliminating the need for distillation training while achieving state-of-the-art performance. Its novel two-stage approach—combining unified format warm-up with iterative self-improvement—creates a powerful model that excels at extracting text, tables, and formulas from diverse document types.
The model’s strong performance across multiple benchmarks, support for both English and Chinese, and practical deployment characteristics make it a valuable tool for researchers, businesses, and developers working with document digitization and conversion.
As the team continues to address current limitations and expand capabilities, POINTS-Reader is poised to become an increasingly versatile solution for turning complex documents into structured, editable digital content.