Understanding DeepSeek-OCR 2

DeepSeek-OCR 2 was released recently. It is the latest model in the DeepSeek-OCR series. The novelty is not just about the model, but also about the modification of the vision encoder. The DeepEncoder V2 allows for visual causal flow capable of dynamically ordering visual tokens. We will discuss this in detail further in the article. This article will cover the most important aspects of the DeepSeek-OCR 2 paper and try to understand how the architecture is built.

Figure 1. Attention mask, DeepEncoder V2, and architecture from the DeepSeek-OCR 2 paper.

DeepSeek-OCR 2 was introduced by authors from DeepSeek in the paper titled DeepSeek-OCR 2: Visual Causal Flow. Along with understanding the theoretical parts of the paper, we will also go through the code from the Hugging Face repository. This will give us a better overview of the model architecture.

We will cover the following while discussing the DeepSeek-OCR 2 paper:

The new components of the DeepSeek-OCR 2 model, e.g., the DeepEncoder V2.
The overall architecture of the model.
How does DeepSeek-OCR 2 stack up against other models?
Covering the architecture code details from Hugging Face.

This is follow up to the previous article where we discussed inference using DeepSeek-OCR 2. We covered the following in the article:

Creating a CLI runnable script for PDF and image OCR. Supporting both BF16 and INT4 formats.
A Gradio app with markdown rendering and carrying out OCR on PDF batches.

SIGN UP TO RECEIVE WEEKLY UPDATES

DeepEncoder V2: Visual Causal Flow

DeepSeek-OCR 2 introduces a fundamental shift in how visual information is processed inside a vision-language model. The new architecture does not treat the vision encoder as a static feature extractor. Rather, the model equips it with causal reasoning capabilities through a redesigned encoder known as DeepEncoder V2.

Figure 1. CLIP-style DeepEncoder vs DeepSeek-OCR 2 DeepEncoder V2.

Traditional vision encoders flatten 2D image patches into a 1D sequence using a fixed raster-scan order (top-left to bottom-right). While effective for natural images, this introduces an artificial inductive bias for structured documents, where reading order matters. In structured documents and text-heavy documents, semantic layout matters more. DeepEncoder V2 is capable of dynamically determining the order in which visual tokens should be read, based on image content.

The key idea is visual causal flow. DeepEncoder V2 augments standard visual tokens with a parallel set of learnable causal flow query tokens. These queries are processed autoregressively, just like tokens in a language model, while still having full access to all visual tokens. As a result, the encoder itself learns a semantic reading sequence before any decoding begins.

This design allows the encoder to:

Reorder visual information dynamically
Preserve global visual context
Align visual token processing with the causal nature of LLM decoding

Unlike earlier approaches (e.g., CLIP-based encoders or bidirectional Q-formers), the causal flow queries impose a directional structure on visual understanding.

Overall Architecture of DeepSeek-OCR 2

At a high level, DeepSeek-OCR 2 retains the encoder-decoder structure.

The architecture consists of three major stages:

Vision Tokenizer: First, a SAM-based vision tokenizer processes the input image. This contains convolutional components. This stage performs a 16x visual token compression, significantly reducing computational cost while preserving spatial and semantic information.
DeepEncoder V2 (LLM-style Vision Encoder): The model then passes the compressed visual token into DeepEncoder V2. Instead of a pure vision transformer, DeepEncoder V2 uses a decoder-only language model architecture with a custom attention mask:
- Visual tokens attend bidirectionally (ViT-style)
- Causal flow queries attend autoregressively (LLM-style). The authors use the Qwen2 500M decoder LM for this.
- Visual tokens are prepended as a prefix, while causal queries are appended as a suffix. Each causal query token can attend to all visual tokens and all previous causal queries, allowing progressive semantic reordering.
MoE Decoder (DeepSeek-3B): The architecture then forwards the causal flow queries to the decoder. It is a Mixture-of-Experts, the DeepSeek-3B from DeepSeek-OCR: a Mixture-of-Experts LLM with roughly 500M active parameters. This ensures that performance gains primarily come from encoder-side reasoning rather than brute-force decoding capacity.

This cascade creates a two-stage causal reasoning pipeline:

Stage 1: Causal reasoning over visual perception (encoder)
Stage 2: Causal reasoning over language and structure (decoder)

How DeepSeek-OCR 2 Compares with Other OCR and VLM Models?

One of the most striking aspects of DeepSeek-OCR 2 is that its performance gains do not come from increasing the number of visual tokens or scaling model size aggressively. Instead, it focuses on better token ordering and representation.

Figure 4. DeepSeek-OCR 2 vs other models for visual token usage per page on the OmniBench benchmark.

Compared to:

Traditional OCR pipelines (PaddleOCR, MinerU, PP-Structure)
End-to-end VLMs (GPT-4o, Gemini, Qwen-VL, InternVL)
Previous DeepSeek-OCR versions

DeepSeek-OCR 2 achieves:

Higher accuracy with fewer visual tokens
Significantly improved reading order edit distance
Lower repetition rates in production settings

This is particularly important for document OCR, where layout understanding (tables, formulas, multi-column text) often dominates overall performance. The model demonstrates that semantic ordering matters more than raw resolution or token count.

Another important distinction is architectural philosophy. While many VLMs rely on extremely large token budgets and heavy decoder-side reasoning, DeepSeek-OCR 2 uses the encoder as the main architecture component for learning.

I highly recommend going through Section 5 of the paper to understand how the model performs against others, which covers the benchmark in detail.

Understanding the Hugging Face Codebase

This section focuses on the actual Hugging Face implementation of DeepSeek-OCR 2 and how the ideas described in the paper materialize in code.

High-Level structure

At the top level, the Hugging Face model entry point is:

DeepseekOCR2ForCausalLM: part of modeling_deepseekocr2.py
Backed by DeepseekOCR2Model: part of modeling_deepseekocr2.py

The model integrates three major subsystems:

SAM-based vision tokenizer
DeepEncoder V2 (LLM-style encoder with causal flow)
DeepSeek-V2 MoE language decoder

Crucially, there is no separate detection head or coordinate regression module anywhere in the model, modeling_deepseekocr2.

DeepEncoder V2 in Code

DeepEncoder V2 is implemented by repurposing a Qwen2 decoder as a vision encoder, rather than designing a new vision transformer.

This happens in:

build_qwen2_decoder_as_encoder: part of deepencoderv2.py
Qwen2Decoder2Encoder: part of deepencoderv2.py
CustomQwen2Decoder: part of deepencoderv2.py

Instead of cross-attention or projection-based fusion, the encoder is built by concatenating tokens and controlling attention via a custom attention mask.

In code:

Visual tokens => token_type_ids = 0
Causal flow queries => token_type_ids = 1

The attention mask logic enforces:

Bidirectional attention among visual tokens
Causal (autoregressive) attention among query tokens
Queries can attend to all visual tokens and past queries

It is explicitly implemented in _create_custom_4d_mask() inside the modified Qwen2 forward pass.

Figure 5. Custom 4D attention mask showing bidirectional visual tokens and causal query tokens.

Token Flow: Where “Visual Causal Flow” Actually Happens

The visual processing pipeline in code is:

SAM ViT backbone
- Produces spatial feature maps
Downsampling + projection
- Features are projected to the LLM embedding space (896 => 1280)
Query injection
- Learnable queries (query_768 or query_1024) are appended
Causal reordering
- Qwen2-style causal attention produces an ordered sequence
Query-only output
- Only the causal query outputs are forwarded to the decoder

In Qwen2Decoder2Encoder.forward():

x_combined = concat(visual_tokens, learnable_queries)
y = decoder(x_combined, token_type_ids)
y = y[:, n_query:, :]  # keep only causal flow queries

This exactly mirrors the paper’s claim that only reordered tokens are passed downstream, and the decoder never sees raw spatial tokens deepencoderv2.

Figure 6. Visual tokens + causal queries concatenated before masked attention.

The Important Part: No Coordinate Decoder

“The model does not contain any coordinate decoder.”

There is no bounding-box head, no regression layer, no DETR-style decoder, and no spatial prediction module anywhere in the architecture.

Instead:

Bounding boxes appear only in the generated text output
Coordinates are emitted as normalized values inside special tokens like: <|ref|>title<|/ref|><|det|>[[x1,y1,x2,y2], ...]<|/det|>
These are language tokens, not model outputs from a detection head

This is visible in:

The regex-based post-processing (re_match, extract_coordinates_and_label)
The bounding boxes are parsed from text strings, not tensors in modeling_deepseekocr2.py

So the model learns to describe geometry, not predict it numerically.

One Important Nuance

DeepSeek-OCR 2 learns spatial localization implicitly, by conditioning the visual encoder and decoder on textual coordinate annotations during training, rather than via an explicit coordinate prediction module.

In other words:

Coordinates are part of the language modeling objective
The model learns spatial grounding as a sequence generation problem
Visual causal flow helps the model decide which region to talk about next, not where to regress a box

This explains why:

The encoder never outputs (x, y) tensors
The decoder never performs geometric reasoning explicitly
Layout understanding emerges from ordered perception + supervised text

Bounding boxes emitted as structured text, not decoder outputs. — Figure 7. Bounding boxes emitted as structured text, not decoder outputs, by DeepSeek-OCR 2.

Why This Matters Architecturally

This makes DeepSeek-OCR 2 fundamentally different from:

DETR-style document parsers
LayoutLM-style models
Any OCR system with explicit detection heads

DeepSeek-OCR 2 treats document parsing as a causal language problem grounded in visual perception, not a detection problem with language attached.

What This Means Going Further?

DeepSeek-OCR 2 should not be viewed purely as an OCR model. While the paper evaluates it extensively on document understanding benchmarks, the authors’ true architectural contribution lies elsewhere: DeepEncoder V2.

At no point do the authors claim that DeepSeek-OCR 2 represents the final or optimal formulation of OCR. Instead, the model serves as a proof of concept for a more general idea:
a vision encoder that performs causal, ordered reasoning over visual inputs using an LLM-style attention mechanism.

DeepEncoder V2 is not intrinsically tied to text recognition or document layouts. It is, fundamentally, a general-purpose vision–language encoder that:

Accepts dense visual tokens
Introduces learnable causal queries
Produces an ordered semantic representation through autoregressive attention

This makes OCR a convenient training domain, not a limiting one.

Why OCR is a Particularly Good Starting Point

Document OCR naturally benefits from:

Strong latent reading order
Explicit spatial supervision (via textual coordinates)
Clear alignment between perception and language output

These properties make OCR an ideal environment to validate visual causal flow. However, nothing in the DeepEncoder V2 architecture restricts it to documents alone.

Implications for Future Applications

DeepEncoder V2 is trained end-to-end using only language supervision. This means, in principle, we can adapt the same architecture to other vision-language tasks simply by changing the training data:

Image captioning / dense image description
Ordered causal queries can learn narrative flow across regions of an image.
General object detection
The model can emit objects and regions as structured language (as in OCR), without introducing explicit detection heads.
Scene understanding and grounded reasoning
We can train the causal flow to prioritize semantically important regions before fine-grained description.
Document layout understanding beyond OCR
Tables, forms, and diagrams already benefit from the same ordered perception mechanism.

Crucially, we can frame all of these tasks as sequence generation problems. This allows spatial reasoning to emerge implicitly rather than being hard-coded through geometric heads.

What this Does Not Automatically Guarantee

It is important to be precise about what DeepEncoder V2 enables versus what it ensures.

DeepEncoder V2 does not magically outperform specialized detectors without task-appropriate supervision.
Spatial precision is still learned through language supervision, not explicit geometric loss functions.
Performance in non-OCR domains depends heavily on how well the training data encodes spatial structure in text.

In other words, DeepEncoder V2 is architecture-general, not task-agnostic.

Takeaway

With all of the above discussed, we can only hope to see more generalized benchmarks in the future for such architectures.

At DebuggerCafe, we aim to explore some of these directions through open-ended coding experiments. Whether these experiments succeed or fail is secondary. What matters more is developing a deeper, hands-on understanding of what this new architectural idea can and cannot do, and where its strengths and limitations truly lie.

Progress, in this case, is less about chasing scores and more about stress-testing a new way of thinking about visual understanding.

Summary and Conclusion

In this article, we discussed the paper and Hugging Face code of DeepSeek-OCR 2 in detail. Starting from the general discussion of DeepEncoder V2 to understanding the “why and where” in code, we covered a lot. We will surely try to cover some coding and training experiments in the future.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Understanding DeepSeek-OCR 2

DeepEncoder V2: Visual Causal Flow

Overall Architecture of DeepSeek-OCR 2

How DeepSeek-OCR 2 Compares with Other OCR and VLM Models?

Understanding the Hugging Face Codebase

High-Level structure

DeepEncoder V2 in Code

Token Flow: Where “Visual Causal Flow” Actually Happens

The Important Part: No Coordinate Decoder

One Important Nuance

Why This Matters Architecturally

What This Means Going Further?

Why OCR is a Particularly Good Starting Point

Implications for Future Applications

What this Does Not Automatically Guarantee

Takeaway

Summary and Conclusion

Leave a Reply Cancel reply