Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833 by baonudesifeizhai · Pull Request #17897 · sgl-project/sglang

baonudesifeizhai · 2026-01-28T21:09:51Z

Motivation

#17833
Config mismatch: OCR2 models were detected as DeepseekVL2Config lacking text_config, which broke model init. We explicitly load/override to the DeepSeek OCR config to ensure the correct model class and fields.
MLA vs non‑MLA attention: OCR2’s language config uses non‑MLA, which caused MLA‑only code paths to crash (e.g., ZeroDivision). We route OCR2 to DeepseekForCausalLM (non‑MLA) to avoid MLA‑specific assumptions.
Vision pipeline differences: OCR2 uses a SAM + Qwen2 decoder‑as‑encoder path and different projector dims. We added OCR2‑specific branches (SAM output channels, Qwen2 decoder‑as‑encoder, projector dims).
Weight name mismatch: OCR2 Qwen2 submodule weights use different name prefixes and don’t fit the stacked‑params mapping. We added flexible mapping and direct loading for qwen2_model.*.
Multimodal token count mismatch: OCR2 doesn’t use OCR1’s per‑row newline tokens. We added ocr2_mode tokenization to align token counts with embeddings.

Why CustomQwen2Decoder and Qwen2Decoder2Encoder

OCR2 needs a decoder‑as‑encoder with a custom attention pattern: image tokens attend bidirectionally, while query tokens are causal but can attend to all image tokens.
SGLang’s native Qwen2 path is optimized for standard autoregressive causal attention and does not accept token_type_ids or a custom 4D mask.
CustomQwen2Decoder wraps HF Qwen2 to inject the custom mask via token_type_ids.
Qwen2Decoder2Encoder creates the learnable query tokens (144/256, with interpolation for other sizes) and concatenates them with image features to produce OCR2 vision tokens.

Modifications

Accuracy Tests

Benchmarking and Profiling

 python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-OCR-2 \
  --enable-multimodal \
  --host 0.0.0.0 \
  --port 30000

 curl -sS http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-OCR-2",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe the image." },
          { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1522202176988-66273c2fd55f?w=1024" } }
        ]
      }
    ],
    "max_tokens": 256,
    "temperature": 0
  }'
{"id":"e46038e256c34da883c04b25798a0870","object":"chat.completion","created":1769633877,"model":"deepseek-ai/DeepSeek-OCR-2","choices":[{"index":0,"message":{"role":"assistant","content":"Four women seated around a wooden table in what appears to be a cozy café setting. They are engaged in conversation, each holding a laptop or tablet. The woman on the left is wearing a white sweater and has blonde hair; she holds a glass of water. Next to her, another woman with long brown hair, dressed in a light-colored top, also holds a laptop. Across from them, a woman with dark hair tied back, wearing glasses and a denim jacket over a red shirt, smiles while looking at her screen. To her right, another woman with short black hair, donning a patterned scarf and a red top, looks on with a smile. On the table, there are various items including notebooks, pens, a smartphone, and a small plant. The background suggests an indoor environment with soft lighting and neutral tones.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":863,"total_tokens":1031,"completion_tokens":168,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}(venv) root@94dd724ef711:~/sglang#

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T21:10:20Z

Summary of Changes

Hello @baonudesifeizhai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for DeepSeek-OCR-2 within SGLang. It addresses several critical compatibility issues, including configuration mismatches, attention mechanism differences, and unique vision pipeline requirements. The changes ensure that DeepSeek-OCR-2 models can be correctly loaded, processed, and utilized, enabling advanced multimodal capabilities for OCR and document understanding.

Highlights

DeepSeek-OCR-2 Support: Full integration of DeepSeek-OCR-2, including its unique vision pipeline and tokenization, enabling its use within SGLang.
Configuration Overrides: Implemented explicit loading and overriding of DeepSeek OCR configurations to correctly initialize OCR2 models, addressing DeepseekVL2Config mismatches and ensuring proper model class and field recognition.
Attention Mechanism Handling: Routed OCR2 models to DeepseekForCausalLM to bypass MLA-specific assumptions and prevent crashes caused by non-MLA attention in OCR2's language configuration.
Custom Vision Pipeline: Introduced OCR2-specific vision processing, which includes a SAM + Qwen2 decoder-as-encoder path and adjusted projector dimensions to match the model's requirements.
Weight Loading Fixes: Added flexible weight mapping and direct loading for qwen2_model.* to resolve weight name mismatches encountered in OCR2's Qwen2 submodule.
Tokenization Alignment: Implemented ocr2_mode tokenization to align token counts with embeddings, addressing differences from OCR1's per-row newline token usage.
New Components: Introduced CustomQwen2Decoder for mixed causal masking and Qwen2Decoder2Encoder for generating learnable query tokens within the OCR2 vision encoder.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the DeepSeek-OCR-2 model, which is a significant enhancement. The changes cover documentation, model configuration, the core model implementation, and multimodal processing. The implementation correctly addresses the differences between OCR-1 and OCR-2, including the vision pipeline, attention mechanisms, and tokenization. The code also includes necessary workarounds for handling inconsistencies in Hugging Face model configurations. My feedback primarily focuses on improving code maintainability by reducing duplication. I've suggested refactoring a duplicated model detection logic into a shared helper function and simplifying a function by removing repeated code blocks. Overall, the changes are solid and well-executed.

gemini-code-assist · 2026-01-28T21:13:11Z

+        _processor.ocr2_mode = (
+            str(
+                getattr(getattr(hf_config, "vision_config", None), "model_name", "")
+            ).lower()
+            == "deepencoderv2"
+            or getattr(getattr(hf_config, "projector_config", None), "input_dim", None)
+            == 896
+        )


This logic for detecting if the model is OCR-2 is duplicated in python/sglang/srt/models/deepseek_ocr.py. To improve maintainability and avoid potential inconsistencies, consider creating a shared helper function in python/sglang/srt/utils/hf_transformers_utils.py and using it in both places.

gemini-code-assist · 2026-01-28T21:13:11Z

+        self.is_ocr2 = (
+            str(getattr(self.vision_config, "model_name", "")).lower()
+            == "deepencoderv2"
+            or getattr(self.projector_config, "input_dim", None) == 896
+        )


This logic for detecting if the model is OCR-2 is duplicated in python/sglang/srt/multimodal/processors/deepseek_ocr.py. To improve maintainability and avoid potential inconsistencies, consider creating a shared helper function in python/sglang/srt/utils/hf_transformers_utils.py and using it in both places.

gemini-code-assist · 2026-01-28T21:13:11Z

+                    if torch.sum(patches).item() != 0:
+                        local_features_1 = self.sam_model(patches)
+                        local_features_2 = self.qwen2_model(local_features_1)
+                        local_features = self.projector(local_features_2)
+
+                        global_features_1 = self.sam_model(image_ori)
+                        global_features_2 = self.qwen2_model(global_features_1)
+                        global_features = self.projector(global_features_2)
+
+                        local_features = local_features.view(
+                            -1, local_features.shape[-1]
+                        )
+                        global_features = global_features.view(
+                            -1, global_features.shape[-1]
+                        )
+                        global_local_features = torch.cat(
+                            [
+                                local_features,
+                                global_features,
+                                self.view_seperator[None, :],
+                            ],
+                            dim=0,
+                        )
+                    else:
+                        global_features_1 = self.sam_model(image_ori)
+                        global_features_2 = self.qwen2_model(global_features_1)
+                        global_features = self.projector(global_features_2)
+                        global_features = global_features.view(
+                            -1, global_features.shape[-1]
+                        )
+                        global_local_features = torch.cat(
+                            [global_features, self.view_seperator[None, :]], dim=0
+                        )


There's some code duplication in the if/else block. The computation of global_features is repeated. You can refactor this by moving the common code outside the conditional block to improve readability and maintainability.

global_features_1 = self.sam_model(image_ori) global_features_2 = self.qwen2_model(global_features_1) global_features = self.projector(global_features_2) global_features = global_features.view( -1, global_features.shape[-1] ) if torch.sum(patches).item() != 0: local_features_1 = self.sam_model(patches) local_features_2 = self.qwen2_model(local_features_1) local_features = self.projector(local_features_2) local_features = local_features.view( -1, local_features.shape[-1] ) global_local_features = torch.cat( [ local_features, global_features, self.view_seperator[None, :], ], dim=0, ) else: global_local_features = torch.cat( [global_features, self.view_seperator[None, :]], dim=0 )

baonudesifeizhai · 2026-01-28T21:32:08Z

works after refactoring

yhyang201 · 2026-01-29T02:41:54Z

/tag-and-rerun-ci

yhyang201 · 2026-01-29T02:43:19Z

Awesome work !

JustinTong0323 · 2026-01-29T08:24:49Z

/rerun-failed-ci

mickqian · 2026-01-29T13:55:31Z

+## Launch server
+
+```shell
+python -m sglang.launch_server \


nit: use sglang serve in the future

baonudesifeizhai · 2026-01-29T15:52:50Z

emmm what about the test error ? is that relative?

JustinTong0323 · 2026-01-30T00:10:40Z

emmm what about the test error ? is that relative?

No most of them are flaky

JustinTong0323 · 2026-01-30T00:10:44Z

/tag-and-rerun-ci

baonudesifeizhai added 13 commits January 28, 2026 12:02

fix

64291b8

fix

ac7e820

fix

2a5c56d

fix

6f95b75

fix

049b8ed

fix

68ac7d3

add

ce39dfc

fix

2e2d8f9

fix

b54c070

fix

cf2ed1c

fix

bcaba64

fix

925670c

fix

6377437

baonudesifeizhai requested review from JustinTong0323, mickqian and yhyang201 as code owners January 28, 2026 21:09

github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model deepseek labels Jan 28, 2026

gemini-code-assist Bot reviewed Jan 28, 2026

View reviewed changes

refactoring

c44cea7

github-actions Bot added the run-ci label Jan 29, 2026

mickqian reviewed Jan 29, 2026

View reviewed changes

mickqian approved these changes Jan 29, 2026

View reviewed changes

mickqian approved these changes Jan 30, 2026

View reviewed changes

mickqian merged commit 84ab611 into sgl-project:main Jan 30, 2026
429 of 461 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

model: support DeepSeek-OCR-2 (sgl-project#17897)

8b754f3

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

model: support DeepSeek-OCR-2 (sgl-project#17897)

16cc29e

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

model: support DeepSeek-OCR-2 (sgl-project#17897)

184497d

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

model: support DeepSeek-OCR-2 (sgl-project#17897)

7cadc32

Conversation

baonudesifeizhai commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Jan 28, 2026

Uh oh!

yhyang201 commented Jan 29, 2026

Uh oh!

yhyang201 commented Jan 29, 2026

Uh oh!

JustinTong0323 commented Jan 29, 2026

Uh oh!

mickqian Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

baonudesifeizhai commented Jan 29, 2026

Uh oh!

JustinTong0323 commented Jan 30, 2026

Uh oh!

JustinTong0323 commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

baonudesifeizhai commented Jan 28, 2026 •

edited

Loading