Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833#17897
Conversation
Summary of ChangesHello @baonudesifeizhai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for DeepSeek-OCR-2 within SGLang. It addresses several critical compatibility issues, including configuration mismatches, attention mechanism differences, and unique vision pipeline requirements. The changes ensure that DeepSeek-OCR-2 models can be correctly loaded, processed, and utilized, enabling advanced multimodal capabilities for OCR and document understanding. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the DeepSeek-OCR-2 model, which is a significant enhancement. The changes cover documentation, model configuration, the core model implementation, and multimodal processing. The implementation correctly addresses the differences between OCR-1 and OCR-2, including the vision pipeline, attention mechanisms, and tokenization. The code also includes necessary workarounds for handling inconsistencies in Hugging Face model configurations. My feedback primarily focuses on improving code maintainability by reducing duplication. I've suggested refactoring a duplicated model detection logic into a shared helper function and simplifying a function by removing repeated code blocks. Overall, the changes are solid and well-executed.
| _processor.ocr2_mode = ( | ||
| str( | ||
| getattr(getattr(hf_config, "vision_config", None), "model_name", "") | ||
| ).lower() | ||
| == "deepencoderv2" | ||
| or getattr(getattr(hf_config, "projector_config", None), "input_dim", None) | ||
| == 896 | ||
| ) |
There was a problem hiding this comment.
| self.is_ocr2 = ( | ||
| str(getattr(self.vision_config, "model_name", "")).lower() | ||
| == "deepencoderv2" | ||
| or getattr(self.projector_config, "input_dim", None) == 896 | ||
| ) |
There was a problem hiding this comment.
This logic for detecting if the model is OCR-2 is duplicated in python/sglang/srt/multimodal/processors/deepseek_ocr.py. To improve maintainability and avoid potential inconsistencies, consider creating a shared helper function in python/sglang/srt/utils/hf_transformers_utils.py and using it in both places.
| if torch.sum(patches).item() != 0: | ||
| local_features_1 = self.sam_model(patches) | ||
| local_features_2 = self.qwen2_model(local_features_1) | ||
| local_features = self.projector(local_features_2) | ||
|
|
||
| global_features_1 = self.sam_model(image_ori) | ||
| global_features_2 = self.qwen2_model(global_features_1) | ||
| global_features = self.projector(global_features_2) | ||
|
|
||
| local_features = local_features.view( | ||
| -1, local_features.shape[-1] | ||
| ) | ||
| global_features = global_features.view( | ||
| -1, global_features.shape[-1] | ||
| ) | ||
| global_local_features = torch.cat( | ||
| [ | ||
| local_features, | ||
| global_features, | ||
| self.view_seperator[None, :], | ||
| ], | ||
| dim=0, | ||
| ) | ||
| else: | ||
| global_features_1 = self.sam_model(image_ori) | ||
| global_features_2 = self.qwen2_model(global_features_1) | ||
| global_features = self.projector(global_features_2) | ||
| global_features = global_features.view( | ||
| -1, global_features.shape[-1] | ||
| ) | ||
| global_local_features = torch.cat( | ||
| [global_features, self.view_seperator[None, :]], dim=0 | ||
| ) |
There was a problem hiding this comment.
There's some code duplication in the if/else block. The computation of global_features is repeated. You can refactor this by moving the common code outside the conditional block to improve readability and maintainability.
global_features_1 = self.sam_model(image_ori)
global_features_2 = self.qwen2_model(global_features_1)
global_features = self.projector(global_features_2)
global_features = global_features.view(
-1, global_features.shape[-1]
)
if torch.sum(patches).item() != 0:
local_features_1 = self.sam_model(patches)
local_features_2 = self.qwen2_model(local_features_1)
local_features = self.projector(local_features_2)
local_features = local_features.view(
-1, local_features.shape[-1]
)
global_local_features = torch.cat(
[
local_features,
global_features,
self.view_seperator[None, :],
],
dim=0,
)
else:
global_local_features = torch.cat(
[global_features, self.view_seperator[None, :]], dim=0
)|
/tag-and-rerun-ci |
|
Awesome work ! |
|
/rerun-failed-ci |
| ## Launch server | ||
|
|
||
| ```shell | ||
| python -m sglang.launch_server \ |
There was a problem hiding this comment.
nit: use sglang serve in the future
|
emmm what about the test error ? is that relative? |
No most of them are flaky |
|
/tag-and-rerun-ci |

Motivation
#17833
Config mismatch: OCR2 models were detected as DeepseekVL2Config lacking text_config, which broke model init. We explicitly load/override to the DeepSeek OCR config to ensure the correct model class and fields.
MLA vs non‑MLA attention: OCR2’s language config uses non‑MLA, which caused MLA‑only code paths to crash (e.g., ZeroDivision). We route OCR2 to DeepseekForCausalLM (non‑MLA) to avoid MLA‑specific assumptions.
Vision pipeline differences: OCR2 uses a SAM + Qwen2 decoder‑as‑encoder path and different projector dims. We added OCR2‑specific branches (SAM output channels, Qwen2 decoder‑as‑encoder, projector dims).
Weight name mismatch: OCR2 Qwen2 submodule weights use different name prefixes and don’t fit the stacked‑params mapping. We added flexible mapping and direct loading for qwen2_model.*.
Multimodal token count mismatch: OCR2 doesn’t use OCR1’s per‑row newline tokens. We added ocr2_mode tokenization to align token counts with embeddings.
Why CustomQwen2Decoder and Qwen2Decoder2Encoder
OCR2 needs a decoder‑as‑encoder with a custom attention pattern: image tokens attend bidirectionally, while query tokens are causal but can attend to all image tokens.
SGLang’s native Qwen2 path is optimized for standard autoregressive causal attention and does not accept token_type_ids or a custom 4D mask.
CustomQwen2Decoder wraps HF Qwen2 to inject the custom mask via token_type_ids.
Qwen2Decoder2Encoder creates the learnable query tokens (144/256, with interpolation for other sizes) and concatenates them with image features to produce OCR2 vision tokens.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci