Skip to content

Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833#17897

Merged
mickqian merged 14 commits intosgl-project:mainfrom
baonudesifeizhai:supportocr2
Jan 30, 2026
Merged

Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833#17897
mickqian merged 14 commits intosgl-project:mainfrom
baonudesifeizhai:supportocr2

Conversation

@baonudesifeizhai
Copy link
Copy Markdown
Contributor

@baonudesifeizhai baonudesifeizhai commented Jan 28, 2026

Motivation

#17833
Config mismatch: OCR2 models were detected as DeepseekVL2Config lacking text_config, which broke model init. We explicitly load/override to the DeepSeek OCR config to ensure the correct model class and fields.
MLA vs non‑MLA attention: OCR2’s language config uses non‑MLA, which caused MLA‑only code paths to crash (e.g., ZeroDivision). We route OCR2 to DeepseekForCausalLM (non‑MLA) to avoid MLA‑specific assumptions.
Vision pipeline differences: OCR2 uses a SAM + Qwen2 decoder‑as‑encoder path and different projector dims. We added OCR2‑specific branches (SAM output channels, Qwen2 decoder‑as‑encoder, projector dims).
Weight name mismatch: OCR2 Qwen2 submodule weights use different name prefixes and don’t fit the stacked‑params mapping. We added flexible mapping and direct loading for qwen2_model.*.
Multimodal token count mismatch: OCR2 doesn’t use OCR1’s per‑row newline tokens. We added ocr2_mode tokenization to align token counts with embeddings.

Why CustomQwen2Decoder and Qwen2Decoder2Encoder

OCR2 needs a decoder‑as‑encoder with a custom attention pattern: image tokens attend bidirectionally, while query tokens are causal but can attend to all image tokens.
SGLang’s native Qwen2 path is optimized for standard autoregressive causal attention and does not accept token_type_ids or a custom 4D mask.
CustomQwen2Decoder wraps HF Qwen2 to inject the custom mask via token_type_ids.
Qwen2Decoder2Encoder creates the learnable query tokens (144/256, with interpolation for other sizes) and concatenates them with image features to produce OCR2 vision tokens.

Modifications

Accuracy Tests

Benchmarking and Profiling

 python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR-2 \
  --enable-multimodal \
  --host 0.0.0.0 \
  --port 30000
 curl -sS http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-OCR-2",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe the image." },
          { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1522202176988-66273c2fd55f?w=1024" } }
        ]
      }
    ],
    "max_tokens": 256,
    "temperature": 0
  }'
{"id":"e46038e256c34da883c04b25798a0870","object":"chat.completion","created":1769633877,"model":"deepseek-ai/DeepSeek-OCR-2","choices":[{"index":0,"message":{"role":"assistant","content":"Four women seated around a wooden table in what appears to be a cozy café setting. They are engaged in conversation, each holding a laptop or tablet. The woman on the left is wearing a white sweater and has blonde hair; she holds a glass of water. Next to her, another woman with long brown hair, dressed in a light-colored top, also holds a laptop. Across from them, a woman with dark hair tied back, wearing glasses and a denim jacket over a red shirt, smiles while looking at her screen. To her right, another woman with short black hair, donning a patterned scarf and a red top, looks on with a smile. On the table, there are various items including notebooks, pens, a smartphone, and a small plant. The background suggests an indoor environment with soft lighting and neutral tones.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":863,"total_tokens":1031,"completion_tokens":168,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}(venv) root@94dd724ef711:~/sglang#

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model deepseek labels Jan 28, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @baonudesifeizhai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for DeepSeek-OCR-2 within SGLang. It addresses several critical compatibility issues, including configuration mismatches, attention mechanism differences, and unique vision pipeline requirements. The changes ensure that DeepSeek-OCR-2 models can be correctly loaded, processed, and utilized, enabling advanced multimodal capabilities for OCR and document understanding.

Highlights

  • DeepSeek-OCR-2 Support: Full integration of DeepSeek-OCR-2, including its unique vision pipeline and tokenization, enabling its use within SGLang.
  • Configuration Overrides: Implemented explicit loading and overriding of DeepSeek OCR configurations to correctly initialize OCR2 models, addressing DeepseekVL2Config mismatches and ensuring proper model class and field recognition.
  • Attention Mechanism Handling: Routed OCR2 models to DeepseekForCausalLM to bypass MLA-specific assumptions and prevent crashes caused by non-MLA attention in OCR2's language configuration.
  • Custom Vision Pipeline: Introduced OCR2-specific vision processing, which includes a SAM + Qwen2 decoder-as-encoder path and adjusted projector dimensions to match the model's requirements.
  • Weight Loading Fixes: Added flexible weight mapping and direct loading for qwen2_model.* to resolve weight name mismatches encountered in OCR2's Qwen2 submodule.
  • Tokenization Alignment: Implemented ocr2_mode tokenization to align token counts with embeddings, addressing differences from OCR1's per-row newline token usage.
  • New Components: Introduced CustomQwen2Decoder for mixed causal masking and Qwen2Decoder2Encoder for generating learnable query tokens within the OCR2 vision encoder.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the DeepSeek-OCR-2 model, which is a significant enhancement. The changes cover documentation, model configuration, the core model implementation, and multimodal processing. The implementation correctly addresses the differences between OCR-1 and OCR-2, including the vision pipeline, attention mechanisms, and tokenization. The code also includes necessary workarounds for handling inconsistencies in Hugging Face model configurations. My feedback primarily focuses on improving code maintainability by reducing duplication. I've suggested refactoring a duplicated model detection logic into a shared helper function and simplifying a function by removing repeated code blocks. Overall, the changes are solid and well-executed.

Comment on lines +15 to +22
_processor.ocr2_mode = (
str(
getattr(getattr(hf_config, "vision_config", None), "model_name", "")
).lower()
== "deepencoderv2"
or getattr(getattr(hf_config, "projector_config", None), "input_dim", None)
== 896
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for detecting if the model is OCR-2 is duplicated in python/sglang/srt/models/deepseek_ocr.py. To improve maintainability and avoid potential inconsistencies, consider creating a shared helper function in python/sglang/srt/utils/hf_transformers_utils.py and using it in both places.

Comment on lines +1420 to +1424
self.is_ocr2 = (
str(getattr(self.vision_config, "model_name", "")).lower()
== "deepencoderv2"
or getattr(self.projector_config, "input_dim", None) == 896
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for detecting if the model is OCR-2 is duplicated in python/sglang/srt/multimodal/processors/deepseek_ocr.py. To improve maintainability and avoid potential inconsistencies, consider creating a shared helper function in python/sglang/srt/utils/hf_transformers_utils.py and using it in both places.

Comment on lines +1538 to +1570
if torch.sum(patches).item() != 0:
local_features_1 = self.sam_model(patches)
local_features_2 = self.qwen2_model(local_features_1)
local_features = self.projector(local_features_2)

global_features_1 = self.sam_model(image_ori)
global_features_2 = self.qwen2_model(global_features_1)
global_features = self.projector(global_features_2)

local_features = local_features.view(
-1, local_features.shape[-1]
)
global_features = global_features.view(
-1, global_features.shape[-1]
)
global_local_features = torch.cat(
[
local_features,
global_features,
self.view_seperator[None, :],
],
dim=0,
)
else:
global_features_1 = self.sam_model(image_ori)
global_features_2 = self.qwen2_model(global_features_1)
global_features = self.projector(global_features_2)
global_features = global_features.view(
-1, global_features.shape[-1]
)
global_local_features = torch.cat(
[global_features, self.view_seperator[None, :]], dim=0
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's some code duplication in the if/else block. The computation of global_features is repeated. You can refactor this by moving the common code outside the conditional block to improve readability and maintainability.

                    global_features_1 = self.sam_model(image_ori)
                    global_features_2 = self.qwen2_model(global_features_1)
                    global_features = self.projector(global_features_2)
                    global_features = global_features.view(
                        -1, global_features.shape[-1]
                    )

                    if torch.sum(patches).item() != 0:
                        local_features_1 = self.sam_model(patches)
                        local_features_2 = self.qwen2_model(local_features_1)
                        local_features = self.projector(local_features_2)
                        local_features = local_features.view(
                            -1, local_features.shape[-1]
                        )
                        global_local_features = torch.cat(
                            [
                                local_features,
                                global_features,
                                self.view_seperator[None, :],
                            ],
                            dim=0,
                        )
                    else:
                        global_local_features = torch.cat(
                            [global_features, self.view_seperator[None, :]], dim=0
                        )

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

image works after refactoring

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

Awesome work !

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

## Launch server

```shell
python -m sglang.launch_server \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use sglang serve in the future

@baonudesifeizhai
Copy link
Copy Markdown
Contributor Author

emmm what about the test error ? is that relative?

@JustinTong0323
Copy link
Copy Markdown
Collaborator

emmm what about the test error ? is that relative?

No most of them are flaky

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@mickqian mickqian merged commit 84ab611 into sgl-project:main Jan 30, 2026
429 of 461 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants