DeepSeek-OCR 2: Visual Causal Flow – A New Chapter in Human-Like Visual Understanding
Core Question: How can traditional Vision-Language Models (VLMs) break free from rigid raster-scan limitations to achieve document understanding based on “Visual Causal Flow”?
In the rapidly evolving landscape of multimodal large models, we have grown accustomed to treating images as static 2D matrices, converting them into 1D token sequences for input into Large Language Models (LLMs). However, does the default “top-left to bottom-right” rigid processing really align with human intuition when reading complex documents? When facing academic PDFs containing formulas, tables, multi-column layouts, or complex logical structures, our gaze does not move mechanically line by line; rather, it follows a semantic causal flow—jumping, backtracking, and focusing as guided by logic.
The release of DeepSeek-OCR 2 aims to break this deadlock. It is not merely an iteration of OCR (Optical Character Recognition) tools, but a profound reconstruction of the vision encoder architecture. By introducing a novel design called DeepEncoder V2, DeepSeek-OCR 2 applies an LLM architecture to vision encoding for the first time, achieving dynamic token reordering based on image semantics. This article will deeply analyze the principles, architectural details, real-world performance, and how to deploy this technology in your projects.
Re-evaluating Visual Encoding: From Raster Scan to Causal Perception
Core Question: What are the fundamental flaws in traditional visual encoding when handling complex document layouts?
Most mainstream Vision-Language Models (VLMs) today follow a standard processing pipeline: an image is split into patches, features are extracted by a vision encoder (like CLIP ViT), and then flattened into a sequence based on a fixed spatial order (usually raster scan) to be fed into the LLM. While simple, this approach introduces a strong and unreasonable inductive bias for document understanding—it assumes the logical order of information in an image corresponds exactly to its spatial coordinates.
Imagine reading an academic paper with multi-column layouts, headers/footers, and interspersed chart captions. If you were forced to read from the very first character of the first line to the very last character of the last line without skipping, how confused would you be? Traditional models are stuck in this exact state. Raster scanning destroys the semantic logic of documents; for example, column relationships in tables often require horizontal reading, while superscript/subscript relationships in formulas violate line flow entirely.
This rigid processing leads to two main problems: high computational costs (often requiring thousands of visual tokens, e.g., 6000+, for spatial details, resulting in slow inference) and a lack of logical understanding (models struggle to capture non-linear reading orders, leading to poor performance in reading order detection and complex structure parsing).
DeepSeek-OCR 2’s core insight is: The processing order of visual tokens should be determined by the semantic logic of the image content itself, not rigid spatial coordinates. To achieve this, the DeepSeek team proposed a new paradigm—Visual Causal Flow.
DeepEncoder V2: Reshaping the Vision Encoder with LLM Architecture
Core Question: How does DeepEncoder V2 replace the CLIP component and implement causal attention mechanisms to enable visual reasoning?
The breakthrough of DeepSeek-OCR 2 lies in its entirely new encoder design—DeepEncoder V2. In the previous version of DeepSeek-OCR, the vision encoding part used the classic CLIP architecture. In the V2 version, the team made a bold decision: to replace CLIP with a compact LLM architecture (specifically instantiated as Qwen2-0.5B).
This replacement is not a simple “shell change” but aims to fundamentally alter the flow of information. Models like CLIP (ViT) typically use bidirectional attention, meaning every token can see global information. This is good for feature extraction but cannot generate sequences with causal logic.
DeepEncoder V2 employs a dual-stream attention mechanism, which is the most ingenious part of its architecture:
-
Vision Tokens: They retain the bidirectional attention similar to ViT. This means the original image features are extracted with global perception, capturing the overall context of the image, which is the foundation for ensuring feature quality. -
Causal Flow Tokens: These are newly introduced learnable query tokens. These tokens use a causal attention mechanism similar to an LLM Decoder, meaning each token can only see the tokens before it.
These two types of tokens are concatenated together via a specific Attention Mask. Vision tokens exist as a “prefix,” while causal flow tokens act as a “suffix.” In this architecture, causal flow tokens can “attend” to all vision tokens (via the design of the Attention Mask), but interact with each other sequentially in a unidirectional manner.
Application Scenario & Value:
This design is like a human reader with cognitive abilities. The vision tokens are the entire page spread out before them (global view), and the causal flow tokens are their thought process. They don’t read mechanically but decide where to look first and next based on the guidance of visual information. Through this method, DeepEncoder V2 achieves “reordering” and “causalization” of visual information, performing a deep logic cleanup before passing the information to the backend LLM Decoder.
Visual Token Compression & Multi-Crop Strategy
Core Question: How does DeepEncoder V2 maintain high performance while keeping visual token counts extremely low?
In document understanding tasks, resolution is the lifeblood of accuracy but the nightmare of computational cost. DeepSeek-OCR 2 adopts a very pragmatic multi-crop strategy to balance both.
First, the model performs initial feature compression via an 80M parameter Vision Tokenizer (based on SAM-base and Conv layers). This module compresses the number of image tokens by 16x, significantly reducing the pressure on subsequent modules’ VRAM and computation.
Next, DeepEncoder V2 defines two cropping modes:
-
Global View: Fixed resolution of 1024×1024. Corresponds to 256 causal query tokens ( Query_global). This ensures the model’s perception of the document’s overall layout. -
Local Crops: Fixed resolution of 768×768. When the image is large, the system automatically crops 0 to 6 local views based on image dimensions. All local views share a set of 144 causal query tokens ( Query_local).
The Wisdom of Token Budgeting:
Through this design, the total number of visual tokens input to the LLM by DeepSeek-OCR 2 is controlled between 256 and 1120.
-
Minimum value 256: Corresponds to only the global view, matching the compression rate of DeepSeek-OCR. -
Maximum value 1120: Corresponds to 6 local views + 1 global view. Notably, this number matches the maximum visual token budget of Gemini-3 Pro, yet compared to other open-source models using 6000+ tokens, the efficiency improvement is exponential.
This means that when processing HD PDFs or long images with dense information, we don’t need to worry about VRAM overflow, and we can achieve extremely high inference speeds while retaining sufficient local details to recognize dense text or tiny symbols.
Dual Causal Reasoning: Towards Genuine 2D Understanding
Core Question: How does this architecture transform 2D image understanding into two cascaded 1D causal reasoning processes?
The architecture of DeepSeek-OCR 2 is not just an upgrade of the vision encoder; it actually builds a system of dual cascaded causal reasoning.
-
First Reasoning: DeepEncoder V2 (Reading Logic Reasoning).
The encoder reorders the originally unordered or spatially sorted vision tokens based on the semantic logic of the image via causal flow tokens. This is a reasoning process at the “reading logic” level. It determines the order of information input, mimicking the fixation point transfer of human eye movements. This step solves the problem of “what order to look in.” -
Second Reasoning: DeepSeek-MoE Decoder (Visual Task Reasoning).
The backend Large Language Model Decoder receives these causally ordered token sequences and performs traditional autoregressive reasoning to generate the final text or structured data. This step solves the problem of “how to understand what is seen.”
Reflection & Unique Insight:
This design is highly inspiring. In the past, we tried to handle 2D structures directly through complex 2D positional encodings or Graph Neural Networks (GNNs). DeepSeek-OCR 2 proposes a new possibility: Perhaps we don’t need to force the model to understand 2D, but instead approximate 2D understanding through two orthogonal 1D causal reasoning passes.
The first 1D (Encoder) is responsible for “folding” 2D spatial information into a 1D time sequence that conforms to logic; the second 1D (Decoder) then performs semantic decoding on this time sequence. This decomposition not only simplifies training difficulty but also cleverly bridges the gap between 2D visual structures and 1D language models.
Practical Evaluation: Excellence on OmniDocBench v1.5
Core Question: How does the actual performance of DeepSeek-OCR 2 compare, and what are the improvements over the previous generation and competitors?
Whether the theory holds up ultimately depends on benchmark data. DeepSeek-OCR 2 has delivered a satisfying answer on the authoritative document parsing benchmark, OmniDocBench v1.5.
Core Metric Comparison
OmniDocBench v1.5 contains 1,355 Chinese and English document pages, covering complex types like magazines, academic papers, and research reports. DeepSeek-OCR 2 achieved an Overall score of 91.09%.
More importantly, this performance was achieved with an extremely low visual token budget (V-token_max = 1120).
| Metric | DeepSeek-OCR (9-crops) | DeepSeek-OCR 2 | Change |
|---|---|---|---|
| Overall ↑ | 87.36% | 91.09% | +3.73% |
| Text Edit Distance ↓ | 0.073 | 0.048 | -0.025 |
| Formula CDM ↑ | 84.14% | 90.31% | +6.17% |
| Table TEDs ↑ | 85.25% | 87.75% | +2.5% |
| Table TEDSs ↑ | 89.01% | 92.06% | +3.05% |
| R-order Edit Distance ↓ | 0.085 | 0.057 | -0.028 |
Note: ↓ means lower is better, ↑ means higher is better. Edit Distance (ED) measures the difference between prediction and ground truth; lower is better.
Deep Dive: The Massive Leap in Reading Order
One of the most noteworthy metrics is the R-order Edit Distance (Reading Order Edit Distance), which dropped significantly from 0.085 to 0.057. This directly proves the effectiveness of the “Visual Causal Flow” in DeepEncoder V2. The model no longer outputs text blindly in spatial order but can recognize the document’s logical structure, such as reading the title first, then the body, and correctly handling dual-column layouts.
Even when compared to Gemini-3 Pro, DeepSeek-OCR 2 outperforms the competitor in overall document parsing Edit Distance (0.100 vs 0.115), while both use a similar visual token budget (1120). This means DeepSeek-OCR 2 achieves world-leading parsing accuracy at equal or even lower resource consumption.
Production Environment Validation: Significant Reduction in Repetition Rate
Beyond academic benchmarks, DeepSeek-OCR 2 has been validated in actual production environments. For OCR models serving LLMs, the biggest nightmare in production is often “repetitive generation,” where the model loops outputting the same text fragment, severely destroying the downstream LLM experience.
According to DeepSeek’s online data:
-
Online User Log Images: DeepSeek-OCR had a repetition rate of 6.25%, while DeepSeek-OCR 2 dropped to 4.17%. -
PDF Pre-training Data: Repetition rate dropped from 3.69% to 2.88%.
This data decline reflects the improvement in the model’s logical understanding of the visual world. When the model understands the start, end, and structure of the document, the probability of logical dead loops naturally decreases.
Installation & Deployment Guide: Running DeepSeek-OCR 2 from Scratch
Core Question: As a developer, how can I configure and run DeepSeek-OCR 2 in my local environment?
The open-source model for DeepSeek-OCR 2 has been released via Hugging Face. For optimal performance, the official recommendation is to use vLLM for inference, while also supporting the standard Transformers library.
Environment Preparation
First, you need to configure a CUDA-supported environment. The official test environment is CUDA 11.8 and PyTorch 2.6.0.
1. Get the Code
git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
2. Create Conda Environment
conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2
3. Install Core Dependencies
Pay special attention to the version matching of PyTorch and vLLM. You need to download the specific version of the vLLM wheel package.
# Install PyTorch 2.6.0
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# Install vLLM (Adjust based on actual filename, need to download corresponding cu118 version from vLLM releases)
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
# Install other dependencies
pip install -r requirements.txt
# Install Flash Attention for acceleration
pip install flash-attn==2.7.3 --no-build-isolation
Note: If you wish to run vLLM and Transformers codes in the same environment, you can ignore the dependency warning regarding transformers>=4.51.1 temporarily; it usually does not affect functionality.
High-Speed Inference with vLLM
vLLM is currently one of the highest throughput LLM inference engines, making it particularly suitable for DeepSeek-OCR 2, which needs to process large numbers of visual tokens. After entering the DeepSeek-OCR2-vllm directory, you can run the following scripts based on your needs:
Scenario 1: Image Streaming Output
When you need to process a single image and see results in real-time:
python run_dpsk_ocr2_image.py
Remember to modify INPUT_PATH and OUTPUT_PATH in config.py.
Scenario 2: PDF Concurrent Processing
This is a strong suit of DeepSeek-OCR 2. It maintains amazing speed comparable to DeepSeek-OCR, making it ideal for batch converting PDFs to Markdown or structured data.
python run_dpsk_ocr2_pdf.py
Scenario 3: Benchmark Evaluation
If you need to evaluate the model’s performance on datasets like OmniDocBench:
python run_dpsk_ocr2_eval_batch.py
Flexible Development Using Transformers Library
If you are used to using the Hugging Face ecosystem for secondary development or integration, you can directly use the Transformers API.
from transformers import AutoModel, AutoTokenizer
import torch
import os
# Specify GPU
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'
# Load model and tokenizer, using Flash Attention 2 for acceleration
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name,
_attn_implementation='flash_attention_2',
trust_remote_code=True,
use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
# Set prompt and image path
# <|grounding|> here acts as a special token prompting the model for structural understanding
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# Execute inference
# base_size=1024 (global view), image_size=768 (local view cropping), crop_mode=True (enable multi-crop)
res = model.infer(tokenizer, prompt=prompt, image_file=image_file,
output_path=output_path, base_size=1024, image_size=768,
crop_mode=True, save_results=True)
Image Source: Unsplash (AI Technology)
Prompt Engineering & Practical Tips
Core Question: How can I write appropriate prompts to guide DeepSeek-OCR 2 to complete different visual tasks?
DeepSeek-OCR 2, as a multi-functional vision-language model, relies heavily on Prompt guidance for its output. Through carefully designed Prompts, you can make it perform tasks ranging from simple OCR to complex document structuring.
Document Structuring & Parsing
For PDF documents, we usually want Markdown output to preserve headers, lists, and table structures.
# Preserve layout and convert to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
The <|grounding|> token here plays a key role; it tells the model that it needs to perform structure-aware output based on visual positioning, not just extract text streams.
Pure Text Extraction
If you only care about text content and not layout, you can use the “Free OCR” mode, which is often faster and removes noise.
# Pure OCR
prompt = "<image>\nFree OCR. "
Chart & Formula Parsing
DeepSeek-OCR 2 has specialized optimizations for formulas and charts.
# Parse charts
prompt = "<image>\nParse the figure. "
# If the document contains formulas, combining with Markdown prompts usually works best
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
# The model will automatically attempt to convert formulas to LaTeX format
General Description & Localization
Besides documents, it can also handle general images.
# Detailed image description
prompt = "<image>\nDescribe this image in detail."
# Locate specific text (Referring Expression Comprehension)
prompt = "<image>\nLocate <|ref|>xxxx<|/ref|> in the image."
This “Ref” mode is powerful; you can ask the model to find and circle (if the model supports drawing output) or focus on specific text regions in the image.
Future Outlook: Towards Native Multimodality
Core Question: What implications does the architecture of DeepSeek-OCR 2 have for the future of multimodal models?**
The success of DeepEncoder V2 validates an exciting possibility: LLM architectures can serve as universal encoders for diverse modalities.
Currently, we use ViT for images, specialized acoustic models for audio, and Transformers for text. This creates a chasm between modalities. DeepSeek-OCR 2 proposes the vision of “Native Multimodality”: a unified Transformer architecture sharing , projections, attention mechanisms, and FFNs. Different modalities (image, audio, text) only need different configurations of “learnable queries.”
This is like the same brain needing only different “glasses” (query vectors) to see images, hear sounds, and read text. The “Visual Causal Flow” query in DeepSeek-OCR 2 is a crucial step toward this future. Through this architecture, the model can seamlessly inherit various infrastructure optimizations from the LLM community, such as Mixture-of-Experts (MoE), Flash Attention, etc., avoiding reinventing the wheel for the visual modality.
Conclusion: The Intelligence of Causal Flow
DeepSeek-OCR 2 is not just a more powerful OCR tool; it acts more like an explorer mimicking human cognitive processes. It shows us that making machines “see” like humans involves not just increasing resolution or stacking parameters, but more importantly, letting machines learn to “think logically.”
Through the Visual Causal Flow of DeepEncoder V2, we see an elegant solution for mapping 2D image understanding to 1D causal logic. This model, which maintains extremely high token compression (1120 tokens) while achieving deep document understanding, will undoubtedly provide strong underlying support for large-scale document digitization, knowledge base construction, and RAG (Retrieval-Augmented Generation) systems.
For developers, now is the best time to experience this new architecture. Whether using vLLM for efficient data cleaning or integrating it into applications via Transformers, DeepSeek-OCR 2 demonstrates a high level of production readiness.
Practical Summary / Actionable Checklist
-
Architecture Understanding: Remember DeepEncoder V2 = LLM Architecture (Qwen2-0.5B) + Dual-Stream Attention (Vision Bidirectional + Query Causal). -
Core Advantages: Extremely low visual token count (256-1120), fast inference, and strong reading order logic. -
Installation Environment: CUDA 11.8, Torch 2.6.0, vLLM 0.8.5. -
Inference Modes: -
High-concurrency PDF processing uses vLLM. -
Single-image flexible development uses Transformers.
-
-
Prompting Tips: -
Document parsing must use <|grounding|>+ Markdown. -
Pure text extraction uses Free OCR.
-
-
Performance Benchmark: OmniDocBench v1.5 Overall 91.09%, Production environment repetition rate as low as 4.17%.
One-Page Summary
| Item | DeepSeek-OCR 2 Feature |
|---|---|
| Core Innovation | DeepEncoder V2 (Visual Causal Flow) |
| Vision Encoder | Qwen2-0.5B (LLM architecture replacing CLIP) |
| Max Token Count | 1120 (Global + Local Crops) |
| Inference Engine | vLLM (Recommended), Transformers |
| Benchmark | OmniDocBench v1.5: 91.09% |
| Main Applications | PDF to Markdown, Complex Table Parsing, Formula Recognition |
| Open Source Link | GitHub: deepseek-ai/DeepSeek-OCR-2 |
Frequently Asked Questions (FAQ)
Q1: What is the biggest difference between DeepSeek-OCR 2 and DeepSeek-OCR (First Generation)?
A: The biggest difference lies in the architecture. DeepSeek-OCR 2 replaces the original CLIP component with DeepEncoder V2. The new encoder adopts an LLM-style architecture, introducing causal flow queries capable of dynamically reordering visual tokens based on image semantics, significantly improving reading order and complex document understanding.
Q2: I only have a 24GB GPU, can I run DeepSeek-OCR 2?
A: Yes. One of DeepSeek-OCR 2’s strengths is its extremely high visual token compression (max only 1120 tokens). Compared to other VLM models that often occupy 20GB+ of VRAM, it runs very easily on 24GB VRAM or even lower, especially with vLLM support.
Q3: How can I get the best performance and accuracy when processing multi-page PDFs?
A: It is recommended to use DeepSeek-OCR 2’s built-in vLLM inference script (run_dpsk_ocr2_pdf.py). It supports concurrent processing, fully utilizing the multi-crop strategy to capture document details while maintaining extremely high processing speeds. For prompts, we suggest using <|grounding|>Convert the document to markdown. to preserve structure.
Q4: What is “Visual Causal Flow,” and why is it important?
A: Visual Causal Flow means the model no longer reads images in a fixed left-to-right, top-to-bottom order, but like a human, determines the reading order based on the content’s logical relationships (e.g., columns in a table, structure of formulas). This allows the model to output text conforming to human cognitive logic when handling complex layouts, avoiding confusion.
Q5: Besides OCR, what else can DeepSeek-OCR 2 do?
A: In addition to text extraction, it can perform document structuring (to Markdown), chart parsing, general image description, and text-based localization tasks (Locate <|ref|>text<|/ref|>), possessing strong general visual understanding capabilities.
Q6: What should I do if I encounter a Flash Attention installation failure during setup?
A: Flash Attention is an optional acceleration component. If installation fails, try removing the --no-build-isolation parameter, or set _attn_implementation='eager' instead of flash_attention_2 in the inference code. While speed may decrease, basic functionality remains unaffected.
Q7: Does the model support Chinese document parsing?
A: Yes. DeepSeek-OCR 2 performs excellently on OmniDocBench v1.5 (containing Chinese and English documents), and its training data includes a large amount of Chinese OCR data, offering high accuracy for Chinese typography and formula recognition.
Q8: How can I adjust the verbosity of the model’s output?
A: You can control this by adjusting the Prompt. For example, using “Describe this image in detail.” will yield longer descriptions, while “Free OCR.” will only output the most concise text content without structural tags.

