[GLM-OCR] Support GLM-OCR Model by zRzRzRzRzRzRzR · Pull Request #17582 · sgl-project/sglang

zRzRzRzRzRzRzR · 2026-01-22T12:35:06Z

Support for GLM-OCR Model, transformers PR here
Need transformer>=5.0.0dev0, but not 5.0.0, so also changed min requirements in glm4v.

# Conflicts: # python/sglang/srt/models/glm4_moe_lite.py

gemini-code-assist · 2026-01-22T12:35:38Z

Summary of Changes

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the GLM-OCR model, a multimodal model designed for accurate and fast document parsing with robust optical character recognition. The changes encompass the integration of its vision and language components, including specialized attention mechanisms and weight loading procedures. It also prepares the model for efficient inference by enabling speculative decoding and leveraging Flash Attention 3, while adjusting core library dependencies to meet its requirements.

Highlights

GLM-OCR Model Support: Added comprehensive support for the GLM-OCR model, including its vision and language components, enabling accurate and fast document parsing with robust optical character recognition.
Custom Vision Attention Normalization: Implemented a new qk_normalization_by_head_size mechanism within the VisionAttention layer, specifically tailored for GLM-OCR's vision transformer to enhance attention handling.
GLM4-MoE-Lite Attention Refactoring: Introduced a dedicated Glm4MoeLiteAttention class, incorporating QKV parallel linear, RadixAttention, and a QK normalization mechanism, alongside significant refactoring of the glm4_moe_lite.py module.
Speculative Decoding (NextN) for GLM-OCR: Added full support for speculative decoding (NextN) for GLM-OCR, which allows for more efficient text generation by predicting future tokens.
Transformer Version Requirement Update: Updated the minimum required transformers library version to 5.0.0dev0 to align with GLM-OCR's specific dependencies, explicitly noting that version 5.0.0 is not compatible.
Flash Attention 3 Integration: Configured GLM-OCR models to default to using Flash Attention 3, leveraging its performance benefits for faster processing.
Image Processor Compatibility: Integrated GLM-OCR with the existing Glm4vImageProcessor, ensuring seamless multimodal processing for the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the GLM-OCR model, including updates to model configurations, attention mechanisms, and the addition of new model files. The changes involve integrating GLM-OCR into the existing multimodal framework, adjusting transformer version requirements, and refining attention logic for specific model types. Overall, the changes aim to expand the range of supported multimodal language models.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/models/glm_ocr.py (167-238)

The GlmOcrVisionModel class inherits from Glm4vVisionModel but then largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This approach reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.

python/sglang/srt/layers/attention/vision.py (817)

There is a typo in the attribute name. It should be self.qk_normalization_by_head_size instead of self.qk_norm_by_head_size to match the attribute defined in the __init__ method. This will cause the condition to always be false, preventing the intended normalization from being applied.

            if self.qk_normalization_by_head_size:

python/sglang/srt/models/glm4_moe_lite.py (911-912)

The post_load_weights method is now conditionally called based on getattr(self.config, "mla", False). However, the original comment indicated that DeepseekV2AttentionMLA (which this class uses) always requires post_load_weights to populate packed weights, regardless of whether config.mla is explicitly set. This conditional call might prevent necessary weight processing if config.mla is not explicitly True but DeepseekV2AttentionMLA is still in use, potentially leading to runtime errors or incorrect behavior during CUDA graph capture.

python/sglang/srt/models/glm4v.py (761-762)

The condition if name.startswith("lm_head.") and not self.pp_group.is_last_rank: continue was removed. This means lm_head weights will now be processed on all ranks, even if they are not the last rank in a pipeline parallel setup. If PPMissingLayer is still used for lm_head on non-last ranks, loading weights for it is unnecessary and could lead to unexpected behavior or errors, as PPMissingLayer is typically an empty placeholder.

python/sglang/srt/models/glm_ocr.py (279-326)

The GlmOcrForConditionalGeneration class inherits from Glm4vForConditionalGeneration but largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.

python/sglang/srt/models/glm_ocr.py (159-160)

Similar to previous comments, GlmOcrVisionPatchEmbed is an empty subclass of Glm4vVisionPatchEmbed. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionPatchEmbed would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (82-83)

Similar to GlmOcrRMSNorm, GlmOcrVisionMLP is an empty subclass of Glm4vVisionMLP. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionMLP would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (109)

The parameter qk_normalization_by_head_size is hardcoded to True. It is generally better practice to make such configuration values configurable, ideally passed through the vision_config object, to allow for greater flexibility and easier experimentation with different model settings.

            qk_normalization_by_head_size=getattr(vision_config, "qk_normalization_by_head_size", True),

python/sglang/srt/configs/utils.py (18-19)

The AutoImageProcessor.register call is commented out and replaced with pass. This effectively disables the registration of customized Hugging Face image processors. If this change is intentional and permanent, the commented-out line should be removed to avoid confusion. If it's a temporary workaround, a clear comment explaining the reason and expected resolution would be beneficial for maintainability.

    # AutoImageProcessor.register(config, None, image_processor, None, exist_ok=True)
    # TODO: Re-enable or remove this line with an explanation if permanently disabled.

python/sglang/srt/models/glm_ocr.py (163-164)

Similar to previous comments, GlmOcrVisionPatchMerger is an empty subclass of Glm4vPatchMerger. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vPatchMerger would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (78-79)

The class GlmOcrRMSNorm is defined as an empty subclass of Glm4vRMSNorm. This creates an alias without adding or modifying any functionality. If the intention is purely for renaming or type hinting, a type alias might be more appropriate. If it's a placeholder for future extensions, a comment should clarify this. Otherwise, it introduces unnecessary abstraction.

python/sglang/srt/models/glm_ocr.py (17)

There is a typo in the comment's URL path. "GlmOcr" should be "glm_ocr" to correctly reflect the typical naming convention for Hugging Face model directories.

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm_ocr/modular_glm_ocr.py

python/sglang/srt/models/glm_ocr.py (320)

The expression "mrope_section" in self.config.rope_scaling assumes that self.config.rope_scaling is always a dictionary. If self.config.rope_scaling could be None or another non-dictionary type, this line would raise an AttributeError or TypeError. It's safer to check if self.config.rope_scaling is a dictionary before attempting to access its keys.

        self.is_mrope_enabled = isinstance(self.config.rope_scaling, dict) and "mrope_section" in self.config.rope_scaling

python/sglang/srt/models/glm_ocr.py (441)

Using print(params_dict.keys()) for debugging information is generally discouraged in production code. It's better to use the logger (e.g., logger.error or logger.debug) for consistent logging practices and to allow for configurable log levels.

                    logger.error(f"Parameter {name} not found in params_dict. Available keys: {params_dict.keys()}")

python/sglang/srt/models/glm_ocr_nextn.py (51-55)

This code overrides quant_config to None if it's modelopt_fp4. This seems to be a specific workaround for incompatibility. A comment explaining the reason for this override and the implications for modelopt_fp4 GLM-OCR models would greatly improve code clarity and maintainability.

python/sglang/srt/models/glm_ocr_nextn.py (105-106)

The with get_global_expert_distribution_recorder().disable_this_region(): context manager is used without an accompanying comment explaining why expert distribution recording needs to be disabled in this specific region. Adding a brief explanation would improve code clarity for future maintainers.

python/sglang/srt/models/glm_ocr_nextn.py (126)

Explicitly calling nn.Module.__init__(self) when inheriting from GlmOcrForConditionalGeneration (which itself inherits from nn.Module) is unusual. Typically, super().__init__() is used to ensure proper initialization of the entire inheritance chain. While it might not cause an immediate issue, it could bypass important initialization logic in GlmOcrForConditionalGeneration if that class were to add its own __init__ method later.

        super().__init__()

python/sglang/srt/models/glm_ocr_nextn.py (142-143)

The num_fused_shared_experts is hardcoded to 1 when disable_shared_experts_fusion is false. This value might need to be dynamically determined from the model's configuration, similar to how n_shared_experts is used in Glm4MoeLiteSparseMoeBlock, to ensure flexibility and correctness across different model variants.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

JustinTong0323 · 2026-01-25T18:00:53Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-26T06:18:37Z

/rerun-failed-ci

JustinTong0323 · 2026-01-26T23:16:54Z

/rerun-failed-ci

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

liyucheng09 · 2026-02-06T07:33:12Z

@zRzRzRzRzRzRzR I found no end2end guide on how to run optimized inference with glm ocr with neither sglang/vllm. can you provide a short piece of guide how can we use the best of glm-ocr via sglang?

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

zRzRzRzRzRzRzR added 30 commits December 26, 2025 14:34

1

eb8a834

format

a3c1e0d

Update glm4_moe_lite.py

3c60caf

Merge branch 'sgl-project:main' into glm

8ef2be1

Merge branch 'sgl-project:main' into glm

334c327

init with no norm

efdd3d4

Merge branch 'glm' of https://github.com/zRzRzRzRzRzRzR/sglang into glm

8d91676

update

99c8e1e

Update glm_ocr.py

994c69f

Update glm_ocr.py

d6c26a4

Update glm_ocr.py

91c2b76

using 5.0.0 rope

3304a36

Update glm_ocr.py

1e3110e

update with qk norm

7806a36

add nextn loader

8e5090d

Update glm_ocr.py

8e17835

pr

dbd4150

1

a79ee39

# Conflicts: # python/sglang/srt/models/glm4_moe_lite.py

format

950a0a7

Update glm4_moe_lite.py

46625c3

init with no norm

30be025

update

255c082

Update glm_ocr.py

cd8ef58

Update glm_ocr.py

1b7293d

Update glm_ocr.py

63fbaf8

using 5.0.0 rope

a2eb0e2

Update glm_ocr.py

79a9726

update with qk norm

4f44da5

add nextn loader

cb02eb3

Update glm_ocr.py

1144eb5

zRzRzRzRzRzRzR requested review from Fridge003, Qiaolin-Yu, hebiao064 and ispobock as code owners January 22, 2026 12:35

github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model labels Jan 22, 2026

lint

b62d605

gemini-code-assist Bot reviewed Jan 22, 2026

View reviewed changes

zRzRzRzRzRzRzR added 3 commits January 22, 2026 20:41

Update glm4_moe_lite.py

cdffe66

Update utils.py

c9d9ad6

Update utils.py

48ba779

JustinTong0323 added the high priority label Jan 23, 2026

JustinTong0323 and others added 5 commits January 23, 2026 10:58

Merge branch 'main' into glm

14747a3

Merge branch 'main' into glm

0676375

Refactor register_image_processor to use kwarg instead of positional arg

c28d64c

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

refactor init vision attn qknorm

17329b6

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Merge branch 'main' into glm

ab690c8

github-actions Bot added the run-ci label Jan 25, 2026

Merge branch 'main' into glm

8c685c6

zRzRzRzRzRzRzR mentioned this pull request Jan 26, 2026

Fix: Accept transformers 5.0.0rc0+ for GLM-4.6V models #15549

Closed

6 tasks

Update multimodal_language_models.md

5b3072c

Kangyan-Zhou merged commit 7106f6c into sgl-project:main Jan 27, 2026
189 of 203 checks passed

alisonshao mentioned this pull request Feb 16, 2026

Fix GLM-4V processor registration when glm_ocr is unavailable #18885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLM-OCR] Support GLM-OCR Model#17582

[GLM-OCR] Support GLM-OCR Model#17582
Kangyan-Zhou merged 43 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm

zRzRzRzRzRzRzR commented Jan 22, 2026

Uh oh!

gemini-code-assist Bot commented Jan 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

JustinTong0323 commented Jan 25, 2026

Uh oh!

JustinTong0323 commented Jan 26, 2026

Uh oh!

JustinTong0323 commented Jan 26, 2026

Uh oh!

Uh oh!

liyucheng09 commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zRzRzRzRzRzRzR commented Jan 22, 2026

Uh oh!

gemini-code-assist Bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/srt/models/glm_ocr.py (167-238)

python/sglang/srt/layers/attention/vision.py (817)

python/sglang/srt/models/glm4_moe_lite.py (911-912)

python/sglang/srt/models/glm4v.py (761-762)

python/sglang/srt/models/glm_ocr.py (279-326)

python/sglang/srt/models/glm_ocr.py (159-160)

python/sglang/srt/models/glm_ocr.py (82-83)

python/sglang/srt/models/glm_ocr.py (109)

python/sglang/srt/configs/utils.py (18-19)

python/sglang/srt/models/glm_ocr.py (163-164)

python/sglang/srt/models/glm_ocr.py (78-79)

python/sglang/srt/models/glm_ocr.py (17)

python/sglang/srt/models/glm_ocr.py (320)

python/sglang/srt/models/glm_ocr.py (441)

python/sglang/srt/models/glm_ocr_nextn.py (51-55)

python/sglang/srt/models/glm_ocr_nextn.py (105-106)

python/sglang/srt/models/glm_ocr_nextn.py (126)

python/sglang/srt/models/glm_ocr_nextn.py (142-143)

Uh oh!

JustinTong0323 commented Jan 25, 2026

Uh oh!

JustinTong0323 commented Jan 26, 2026

Uh oh!

JustinTong0323 commented Jan 26, 2026

Uh oh!

Uh oh!

liyucheng09 commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants