DeepSeek-OCR 2 was released recently. It is the latest model in the DeepSeek-OCR series. The novelty is not just about the model, but also about the modification of the vision encoder. The DeepEncoder V2 allows for visual causal flow capable of dynamically ordering visual tokens. We will discuss this in detail further in the article. This article will cover the most important aspects of the DeepSeek-OCR 2 paper and try to understand how the architecture is built.
DeepSeek-OCR 2 was introduced by authors from DeepSeek in the paper titled DeepSeek-OCR 2: Visual Causal Flow. Along with understanding the theoretical parts of the paper, we will also go through the code from the Hugging Face repository. This will give us a better overview of the model architecture.
We will cover the following while discussing the DeepSeek-OCR 2 paper:
- The new components of the DeepSeek-OCR 2 model, e.g., the DeepEncoder V2.
- The overall architecture of the model.
- How does DeepSeek-OCR 2 stack up against other models?
- Covering the architecture code details from Hugging Face.
This is follow up to the previous article where we discussed inference using DeepSeek-OCR 2. We covered the following in the article:
- Creating a CLI runnable script for PDF and image OCR. Supporting both BF16 and INT4 formats.
- A Gradio app with markdown rendering and carrying out OCR on PDF batches.
DeepEncoder V2: Visual Causal Flow
DeepSeek-OCR 2 introduces a fundamental shift in how visual information is processed inside a vision-language model. The new architecture does not treat the vision encoder as a static feature extractor. Rather, the model equips it with causal reasoning capabilities through a redesigned encoder known as DeepEncoder V2.
Traditional vision encoders flatten 2D image patches into a 1D sequence using a fixed raster-scan order (top-left to bottom-right). While effective for natural images, this introduces an artificial inductive bias for structured documents, where reading order matters. In structured documents and text-heavy documents, semantic layout matters more. DeepEncoder V2 is capable of dynamically determining the order in which visual tokens should be read, based on image content.
The key idea is visual causal flow. DeepEncoder V2 augments standard visual tokens with a parallel set of learnable causal flow query tokens. These queries are processed autoregressively, just like tokens in a language model, while still having full access to all visual tokens. As a result, the encoder itself learns a semantic reading sequence before any decoding begins.
This design allows the encoder to:
- Reorder visual information dynamically
- Preserve global visual context
- Align visual token processing with the causal nature of LLM decoding
Unlike earlier approaches (e.g., CLIP-based encoders or bidirectional Q-formers), the causal flow queries impose a directional structure on visual understanding.
Overall Architecture of DeepSeek-OCR 2
At a high level, DeepSeek-OCR 2 retains the encoder-decoder structure.
The architecture consists of three major stages:
- Vision Tokenizer: First, a SAM-based vision tokenizer processes the input image. This contains convolutional components. This stage performs a 16x visual token compression, significantly reducing computational cost while preserving spatial and semantic information.
- DeepEncoder V2 (LLM-style Vision Encoder): The model then passes the compressed visual token into DeepEncoder V2. Instead of a pure vision transformer, DeepEncoder V2 uses a decoder-only language model architecture with a custom attention mask:
- Visual tokens attend bidirectionally (ViT-style)
- Causal flow queries attend autoregressively (LLM-style). The authors use the Qwen2 500M decoder LM for this.
- Visual tokens are prepended as a prefix, while causal queries are appended as a suffix. Each causal query token can attend to all visual tokens and all previous causal queries, allowing progressive semantic reordering.
- MoE Decoder (DeepSeek-3B): The architecture then forwards the causal flow queries to the decoder. It is a Mixture-of-Experts, the DeepSeek-3B from DeepSeek-OCR: a Mixture-of-Experts LLM with roughly 500M active parameters. This ensures that performance gains primarily come from encoder-side reasoning rather than brute-force decoding capacity.
This cascade creates a two-stage causal reasoning pipeline:
- Stage 1: Causal reasoning over visual perception (encoder)
- Stage 2: Causal reasoning over language and structure (decoder)
How DeepSeek-OCR 2 Compares with Other OCR and VLM Models?
One of the most striking aspects of DeepSeek-OCR 2 is that its performance gains do not come from increasing the number of visual tokens or scaling model size aggressively. Instead, it focuses on better token ordering and representation.

Compared to:
- Traditional OCR pipelines (PaddleOCR, MinerU, PP-Structure)
- End-to-end VLMs (GPT-4o, Gemini, Qwen-VL, InternVL)
- Previous DeepSeek-OCR versions
DeepSeek-OCR 2 achieves:
- Higher accuracy with fewer visual tokens
- Significantly improved reading order edit distance
- Lower repetition rates in production settings
This is particularly important for document OCR, where layout understanding (tables, formulas, multi-column text) often dominates overall performance. The model demonstrates that semantic ordering matters more than raw resolution or token count.
Another important distinction is architectural philosophy. While many VLMs rely on extremely large token budgets and heavy decoder-side reasoning, DeepSeek-OCR 2 uses the encoder as the main architecture component for learning.
I highly recommend going through Section 5 of the paper to understand how the model performs against others, which covers the benchmark in detail.
Understanding the Hugging Face Codebase
This section focuses on the actual Hugging Face implementation of DeepSeek-OCR 2 and how the ideas described in the paper materialize in code.
High-Level structure
At the top level, the Hugging Face model entry point is:
DeepseekOCR2ForCausalLM: part ofmodeling_deepseekocr2.py- Backed by
DeepseekOCR2Model: part ofmodeling_deepseekocr2.py
The model integrates three major subsystems:
- SAM-based vision tokenizer
- DeepEncoder V2 (LLM-style encoder with causal flow)
- DeepSeek-V2 MoE language decoder
Crucially, there is no separate detection head or coordinate regression module anywhere in the model, modeling_deepseekocr2.
DeepEncoder V2 in Code
DeepEncoder V2 is implemented by repurposing a Qwen2 decoder as a vision encoder, rather than designing a new vision transformer.
This happens in:
build_qwen2_decoder_as_encoder: part ofdeepencoderv2.pyQwen2Decoder2Encoder: part ofdeepencoderv2.pyCustomQwen2Decoder: part ofdeepencoderv2.py
Instead of cross-attention or projection-based fusion, the encoder is built by concatenating tokens and controlling attention via a custom attention mask.
In code:
- Visual tokens =>
token_type_ids = 0 - Causal flow queries =>
token_type_ids = 1
The attention mask logic enforces:
- Bidirectional attention among visual tokens
- Causal (autoregressive) attention among query tokens
- Queries can attend to all visual tokens and past queries
It is explicitly implemented in _create_custom_4d_mask() inside the modified Qwen2 forward pass.
Token Flow: Where “Visual Causal Flow” Actually Happens
The visual processing pipeline in code is:
- SAM ViT backbone
- Produces spatial feature maps
- Downsampling + projection
- Features are projected to the LLM embedding space (896 => 1280)
- Query injection
- Learnable queries (
query_768orquery_1024) are appended
- Learnable queries (
- Causal reordering
- Qwen2-style causal attention produces an ordered sequence
- Query-only output
- Only the causal query outputs are forwarded to the decoder
In Qwen2Decoder2Encoder.forward():
x_combined = concat(visual_tokens, learnable_queries) y = decoder(x_combined, token_type_ids) y = y[:, n_query:, :] # keep only causal flow queries
This exactly mirrors the paper’s claim that only reordered tokens are passed downstream, and the decoder never sees raw spatial tokens deepencoderv2.
The Important Part: No Coordinate Decoder
“The model does not contain any coordinate decoder.”
There is no bounding-box head, no regression layer, no DETR-style decoder, and no spatial prediction module anywhere in the architecture.
Instead:
- Bounding boxes appear only in the generated text output
- Coordinates are emitted as normalized values inside special tokens like:
<|ref|>title<|/ref|><|det|>[[x1,y1,x2,y2], ...]<|/det|> - These are language tokens, not model outputs from a detection head
This is visible in:
- The regex-based post-processing (
re_match,extract_coordinates_and_label) - The bounding boxes are parsed from text strings, not tensors in
modeling_deepseekocr2.py
So the model learns to describe geometry, not predict it numerically.
One Important Nuance
DeepSeek-OCR 2 learns spatial localization implicitly, by conditioning the visual encoder and decoder on textual coordinate annotations during training, rather than via an explicit coordinate prediction module.
In other words:
- Coordinates are part of the language modeling objective
- The model learns spatial grounding as a sequence generation problem
- Visual causal flow helps the model decide which region to talk about next, not where to regress a box
This explains why:
- The encoder never outputs (x, y) tensors
- The decoder never performs geometric reasoning explicitly
- Layout understanding emerges from ordered perception + supervised text
Why This Matters Architecturally
This makes DeepSeek-OCR 2 fundamentally different from:
- DETR-style document parsers
- LayoutLM-style models
- Any OCR system with explicit detection heads
DeepSeek-OCR 2 treats document parsing as a causal language problem grounded in visual perception, not a detection problem with language attached.
What This Means Going Further?
DeepSeek-OCR 2 should not be viewed purely as an OCR model. While the paper evaluates it extensively on document understanding benchmarks, the authors’ true architectural contribution lies elsewhere: DeepEncoder V2.
At no point do the authors claim that DeepSeek-OCR 2 represents the final or optimal formulation of OCR. Instead, the model serves as a proof of concept for a more general idea:
a vision encoder that performs causal, ordered reasoning over visual inputs using an LLM-style attention mechanism.
DeepEncoder V2 is not intrinsically tied to text recognition or document layouts. It is, fundamentally, a general-purpose vision–language encoder that:
- Accepts dense visual tokens
- Introduces learnable causal queries
- Produces an ordered semantic representation through autoregressive attention
This makes OCR a convenient training domain, not a limiting one.
Why OCR is a Particularly Good Starting Point
Document OCR naturally benefits from:
- Strong latent reading order
- Explicit spatial supervision (via textual coordinates)
- Clear alignment between perception and language output
These properties make OCR an ideal environment to validate visual causal flow. However, nothing in the DeepEncoder V2 architecture restricts it to documents alone.
Implications for Future Applications
DeepEncoder V2 is trained end-to-end using only language supervision. This means, in principle, we can adapt the same architecture to other vision-language tasks simply by changing the training data:
- Image captioning / dense image description
Ordered causal queries can learn narrative flow across regions of an image. - General object detection
The model can emit objects and regions as structured language (as in OCR), without introducing explicit detection heads. - Scene understanding and grounded reasoning
We can train the causal flow to prioritize semantically important regions before fine-grained description. - Document layout understanding beyond OCR
Tables, forms, and diagrams already benefit from the same ordered perception mechanism.
Crucially, we can frame all of these tasks as sequence generation problems. This allows spatial reasoning to emerge implicitly rather than being hard-coded through geometric heads.
What this Does Not Automatically Guarantee
It is important to be precise about what DeepEncoder V2 enables versus what it ensures.
- DeepEncoder V2 does not magically outperform specialized detectors without task-appropriate supervision.
- Spatial precision is still learned through language supervision, not explicit geometric loss functions.
- Performance in non-OCR domains depends heavily on how well the training data encodes spatial structure in text.
In other words, DeepEncoder V2 is architecture-general, not task-agnostic.
Takeaway
With all of the above discussed, we can only hope to see more generalized benchmarks in the future for such architectures.
At DebuggerCafe, we aim to explore some of these directions through open-ended coding experiments. Whether these experiments succeed or fail is secondary. What matters more is developing a deeper, hands-on understanding of what this new architectural idea can and cannot do, and where its strengths and limitations truly lie.
Progress, in this case, is less about chasing scores and more about stress-testing a new way of thinking about visual understanding.
Summary and Conclusion
In this article, we discussed the paper and Hugging Face code of DeepSeek-OCR 2 in detail. Starting from the general discussion of DeepEncoder V2 to understanding the “why and where” in code, we covered a lot. We will surely try to cover some coding and training experiments in the future.
If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.
You can contact me using the Contact section. You can also find me on LinkedIn, and X.







