POINTS-Reader: A Breakthrough in Document Conversion Without Distillation Training

The Challenge of Modern Document Conversion

In our increasingly digital world, the ability to accurately convert physical documents into editable digital formats has become essential. From academic research papers and technical manuals to financial reports and legal documents, we regularly encounter materials that contain complex elements like multi-column layouts, structured tables, and mathematical formulas.

Traditional approaches to this problem have typically followed one of two paths:

Pipeline methods that combine multiple specialized tools
End-to-end models trained through knowledge distillation from larger models

Both approaches have significant limitations. Pipeline methods require stitching together different components for text recognition, table extraction, and formula parsing, often resulting in inconsistent output and error accumulation. Distillation-based methods, while more integrated, inherently limit the student model to the capabilities of the teacher model and can propagate errors through the system.

POINTS-Reader represents a fundamental shift in approach—a vision-language model specifically designed for document conversion that completely eliminates the need for distillation training while achieving state-of-the-art performance.

What Makes POINTS-Reader Different?

POINTS-Reader introduces a novel two-stage framework that generates high-quality training data without relying on existing models for distillation. This approach addresses the core limitations of previous methods:

No performance ceiling: Unlike distillation, where student models can never exceed teacher capabilities
No error propagation: Avoids inheriting and amplifying mistakes from teacher models
Greater innovation potential: Enables true architectural and methodological advances

The model builds upon the POINTS1.5 architecture but replaces Qwen2.5-7B-Instruct with the more efficient Qwen2.5-3B-Instruct, maintaining performance while improving throughput. It uses a 600M parameter NaViT visual encoder, carefully chosen to balance capability with computational efficiency.

The Two-Stage Training Framework

Stage 1: Uniform Format Warm-up

The first stage addresses a fundamental challenge in document conversion: diverse elements require different output formats. POINTS-Reader establishes consistent formatting rules:

Plain text: Rendered in Markdown format
Tables: Represented using HTML (chosen for its ability to handle complex structures like merged cells)
Mathematical formulas: Expressed in LaTeX syntax with inline ( $\dots$ ) and display (…) notation

With these unified formats established, the system generates diverse synthetic data through a sophisticated process:

Content generation: A large language model creates text based on carefully designed prompts across four categories:
- Plain text only
- Text with mathematical formulas
- Text with tables

Multi-column layouts with tables

Quality filtering: Rule-based filters ensure generated tables and formulas meet structural and syntactic requirements
Rendering: Filtered content is converted to HTML and rendered into images using Chrome’s headless mode
Training: The resulting image-text pairs fine-tune a general vision-language model

This process generates over 200,000 samples per category, creating a diverse dataset that teaches the model to output document elements in a consistent format.

Stage 2: Iterative Self-Improvement

While synthetic data enables large-scale training, its distribution often differs from real-world documents. The second stage bridges this gap through an iterative self-improvement process:

Inference: The model from Stage 1 processes real documents from the DocMatrix dataset (over 2 million document images)
Filtering: Multiple validation strategies automatically identify and retain high-quality outputs
Retraining: Filtered data is used to train an improved model version
Repetition: The process repeats through multiple iterations, progressively enhancing both model performance and data quality

This approach creates a virtuous cycle where each iteration produces better training data, which in turn trains a better model.

Advanced Data Filtering Strategies

The success of the self-improvement stage hinges on effective filtering strategies that ensure only high-quality data is retained for training.

Text Filtering Using F1-Score

For plain text extraction, the system employs a sophisticated F1-score based approach:

Traditional OCR (PaddleOCR) extracts reference text from images
Both model predictions and references are normalized by removing non-alphanumeric characters and splitting by spaces
The system counts occurrences of each unit and calculates precision, recall, and F1-score
Samples with F1-scores below a threshold (typically 0.9) are discarded

This approach effectively identifies and removes samples with hallucinations, repetitions, or omissions—common issues in complex layout parsing.

Structural Validation for Tables

Rather than relying on potentially unreliable table structure recognition models, POINTS-Reader implements a rule-based approach focused on structural validity:

Verification of consistent cell counts across rows and columns
Validation of HTML syntax compliance
Removal of samples with invalid table structures

This method ensures that retained tables are structurally sound, even if the content isn’t explicitly validated.

Syntax Checking for Mathematical Formulas

For mathematical formulas, the filtering process focuses exclusively on syntactic correctness:

Extraction of all formulas from model outputs
LaTeX syntax validation
Discarding of samples containing invalid formulas

This approach recognizes that while semantic validation would be ideal, syntactic correctness is a practical and effective filter for improving formula recognition.

Performance Excellence: Benchmark Results

POINTS-Reader demonstrates impressive performance across multiple benchmarks, outperforming both pipeline methods and other vision-language models.

OmniDocBench Results

On the comprehensive OmniDocBench evaluation, which assesses performance across text, formulas, tables, and reading order, POINTS-Reader achieves:

Overall score: 0.259 (lower is better)
Text extraction: 0.176 edit distance
Formula recognition: 0.383 edit distance
Table extraction: 0.335 edit distance
Reading order: 0.144 edit distance

These results are particularly impressive when compared to other approaches:

Vs. pipeline methods: POINTS-Reader outperforms MinerU (0.150), Marker (0.336), and Mathpix (0.191)
Vs. general VLMs: It surpasses Qwen2.5-VL-72B (0.214) despite being significantly smaller
Vs. expert OCR models: It exceeds GOT-OCR (0.287), Nougat (0.452), and Mistral OCR (0.268)

Specialized Strengths

POINTS-Reader shows particular excellence in table recognition, outperforming GOT-OCR by 0.197 on the table metric. This strength stems from its use of HTML for table representation, which better handles complex structures compared to the Markdown format used by many other approaches.

The model also supports both English and Chinese documents, with scores of 0.133 for English and 0.212 for Chinese on OmniDocBench—impressive results that demonstrate its multilingual capabilities.

Practical Implementation and Usage

Installation and Setup

POINTS-Reader requires the following environment configuration:

python==3.10.12
torch==2.5.1
transformers==4.55.2
cuda==12.1

First, install the WePOINTS library:

git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .

Basic Usage with Transformers

The model can be easily implemented using the Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch

# Recommended prompt for optimal performance
prompt = (
    'Please extract all the text from the image with the following requirements:\n'
    '1. Return tables in HTML format.\n'
    '2. Return all other text in Markdown format.'
)
image_path = '/path/to/your/local/image'
model_path = 'tencent/POINTS-Reader'
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             trust_remote_code=True,
                                             torch_dtype=torch.float16,
                                             device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
            dict(type='image', image=image_path),
            dict(type='text', text=prompt)
          ]
messages = [
        {
            'role': 'user',
            'content': content
        }
    ]
generation_config = {
        'max_new_tokens': 2048,
        'repetition_penalty': 1.05,
        'temperature': 0.7,
        'top_p': 0.8,
        'top_k': 20,
        'do_sample': True
    }
response = model.chat(
    messages,
    tokenizer,
    image_processor,
    generation_config
)
print(response)

For issues with repetitive output, try increasing the input image resolution.

High-Throughput Deployment with SGLang

For production environments requiring high throughput, POINTS-Reader supports deployment with SGLang:

python3 -m sglang.launch_server \
--model-path tencent/POINTS-Reader \
--tp-size 1 \
--dp-size 1 \
--chat-template points-v15-chat \
--trust-remote-code \
--port 8081

Once deployed, you can query the model using:

from typing import List
import requests
import json

def call_wepoints(messages: List[dict],
                 temperature: float = 0.0,
                 max_new_tokens: int = 2048,
                 repetition_penalty: float = 1.05,
                 top_p: float = 0.8,
                 top_k: int = 20,
                 do_sample: bool = True,
                 url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
    """Query the WePOINTS model for document conversion"""
    data = {
        'model': 'WePoints',
        'messages': messages,
        'max_new_tokens': max_new_tokens,
        'temperature': temperature,
        'repetition_penalty': repetition_penalty,
        'top_p': top_p,
        'top_k': top_k,
        'do_sample': do_sample,
    }
    response = requests.post(url, json=data)
    response = json.loads(response.text)
    response = response['choices'][0]['message']['content']
    return response

prompt = (
    'Please extract all the text from the image with the following requirements:\n'
    '1. Return tables in HTML format.\n'
    '2. Return all other text in Markdown format.'
)

messages = [{
              'role': 'user',
              'content': [
                  {
                      'type': 'text',
                      'text': prompt
                  },
                  {
                      'type': 'image_url',
                      'image_url': {'url': '/path/to/image.jpg'}
                  }
              ]
            }]
response = call_wepoints(messages)
print(response)

Real-World Application Examples

POINTS-Reader excels at processing diverse document types:

Single-Column Documents with LaTeX Formulas

The model accurately captures both text and complex mathematical expressions, maintaining proper formatting and syntax.

Example of single-column document with formulas

Single-Column Documents with Tables

HTML table representation ensures complex structures with merged cells are properly preserved.

Example of single-column document with tables

Multi-Column Documents with LaTeX Formulas

The model maintains reading order and formatting across multiple columns while accurately capturing mathematical notation.

Example of multi-column document with formulas

Multi-Column Documents with Tables

Even in complex multi-column layouts, POINTS-Reader accurately identifies and extracts table structures while maintaining proper HTML formatting.

Example of multi-column document with tables

Technical Insights from Ablation Studies

Comprehensive experiments validate the design choices in POINTS-Reader’s architecture and training approach.

The Importance of Data Diversity

Incremental addition of different data categories consistently improves performance:

Training Data	Text↓	Table↓	Formula↓	Order↓	Overall↓
Baseline	0.551	0.652	0.730	0.570	0.626
+ Text	0.522	0.641	0.721	0.553	0.609
+ Formula	0.513	0.640	0.600	0.530	0.571
+ Table	0.495	0.590	0.595	0.523	0.551
+ Multi-Column	0.485	0.572	0.511	0.471	0.510

Each category not only improves performance on its specific element type but also contributes to overall metrics, with multi-column data providing particularly significant gains in reading order accuracy.

The Diminishing Returns of Synthetic Data

Experiments reveal that model performance plateaus—and eventually declines—when synthetic data exceeds 800,000 samples. This underscores the fundamental distribution differences between synthetic and real-world documents and highlights the necessity of the iterative self-improvement stage.

Optimal Aspect Ratio Filtering

By filtering images with abnormal aspect ratios (outside the 2/5 to 5/9 range), model performance improves significantly. This optimization aligns with the standard A4 paper aspect ratio (√2 per ISO 216) that dominates real-world documents.

Current Limitations and Future Directions

Despite its impressive capabilities, POINTS-Reader has certain limitations that represent opportunities for future development:

Language support: Currently optimized for English and Chinese, with limited capabilities for other languages
Handwritten text: Performance is suboptimal with handwritten content like notes and receipts
Image extraction: The model currently focuses on text, tables, and formulas, without support for image extraction or localization within documents
Complex layouts: Highly complex layouts (e.g., newspapers with irregular formatting) can still challenge the model

The development team has outlined several priorities for future work:

Expansion of multilingual support to cover more languages
Improved handling of handwritten content through specialized training data
Extension of capabilities to include image extraction and localization
Enhanced processing of extremely complex document layouts

Frequently Asked Questions

How does POINTS-Reader differ from traditional OCR tools?

Traditional OCR tools typically focus only on text recognition and struggle with complex elements like tables and formulas. POINTS-Reader is an end-to-end vision-language model that simultaneously processes text, tables, and formulas, outputting structured formats without requiring multiple specialized tools.

What output formats does POINTS-Reader support?

The model uses unified output formats: Markdown for plain text, HTML for tables, and LaTeX for mathematical formulas. This consistent approach simplifies downstream processing and integration.

How can I improve conversion accuracy for difficult documents?

For challenging documents, try increasing the input image resolution to provide more visual information to the model. Adjusting generation parameters like temperature, top_p, and top_k may also help with specific issues like repetition.

Does POINTS-Reader work with handwritten documents?

Current performance with handwritten documents is limited, as the training data primarily contains printed text. This is an area of active development for future versions.

What is the maximum document length POINTS-Reader can process?

The model supports a context length of 8192 tokens, which is sufficient for most single-page documents. For multi-page documents, we recommend processing one page at a time.

How computationally intensive is POINTS-Reader?

The model is designed for practical deployment, with a 3B parameter language model and 600M parameter visual encoder. With optimization frameworks like SGLang, it achieves satisfactory throughput for production use.

Conclusion

POINTS-Reader represents a significant advancement in document conversion technology by completely eliminating the need for distillation training while achieving state-of-the-art performance. Its novel two-stage approach—combining unified format warm-up with iterative self-improvement—creates a powerful model that excels at extracting text, tables, and formulas from diverse document types.

The model’s strong performance across multiple benchmarks, support for both English and Chinese, and practical deployment characteristics make it a valuable tool for researchers, businesses, and developers working with document digitization and conversion.

As the team continues to address current limitations and expand capabilities, POINTS-Reader is poised to become an increasingly versatile solution for turning complex documents into structured, editable digital content.