Skip to content

[GLM-OCR] Support GLM-OCR Model#17582

Merged
Kangyan-Zhou merged 43 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm
Jan 27, 2026
Merged

[GLM-OCR] Support GLM-OCR Model#17582
Kangyan-Zhou merged 43 commits intosgl-project:mainfrom
zRzRzRzRzRzRzR:glm

Conversation

@zRzRzRzRzRzRzR
Copy link
Copy Markdown
Contributor

Support for GLM-OCR Model, transformers PR here
Need transformer>=5.0.0dev0, but not 5.0.0, so also changed min requirements in glm4v.

@github-actions github-actions Bot added documentation Improvements or additions to documentation Multi-modal multi-modal language model labels Jan 22, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the GLM-OCR model, a multimodal model designed for accurate and fast document parsing with robust optical character recognition. The changes encompass the integration of its vision and language components, including specialized attention mechanisms and weight loading procedures. It also prepares the model for efficient inference by enabling speculative decoding and leveraging Flash Attention 3, while adjusting core library dependencies to meet its requirements.

Highlights

  • GLM-OCR Model Support: Added comprehensive support for the GLM-OCR model, including its vision and language components, enabling accurate and fast document parsing with robust optical character recognition.
  • Custom Vision Attention Normalization: Implemented a new qk_normalization_by_head_size mechanism within the VisionAttention layer, specifically tailored for GLM-OCR's vision transformer to enhance attention handling.
  • GLM4-MoE-Lite Attention Refactoring: Introduced a dedicated Glm4MoeLiteAttention class, incorporating QKV parallel linear, RadixAttention, and a QK normalization mechanism, alongside significant refactoring of the glm4_moe_lite.py module.
  • Speculative Decoding (NextN) for GLM-OCR: Added full support for speculative decoding (NextN) for GLM-OCR, which allows for more efficient text generation by predicting future tokens.
  • Transformer Version Requirement Update: Updated the minimum required transformers library version to 5.0.0dev0 to align with GLM-OCR's specific dependencies, explicitly noting that version 5.0.0 is not compatible.
  • Flash Attention 3 Integration: Configured GLM-OCR models to default to using Flash Attention 3, leveraging its performance benefits for faster processing.
  • Image Processor Compatibility: Integrated GLM-OCR with the existing Glm4vImageProcessor, ensuring seamless multimodal processing for the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the GLM-OCR model, including updates to model configurations, attention mechanisms, and the addition of new model files. The changes involve integrating GLM-OCR into the existing multimodal framework, adjusting transformer version requirements, and refining attention logic for specific model types. Overall, the changes aim to expand the range of supported multimodal language models.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/models/glm_ocr.py (167-238)

high

The GlmOcrVisionModel class inherits from Glm4vVisionModel but then largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This approach reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.

python/sglang/srt/layers/attention/vision.py (817)

high

There is a typo in the attribute name. It should be self.qk_normalization_by_head_size instead of self.qk_norm_by_head_size to match the attribute defined in the __init__ method. This will cause the condition to always be false, preventing the intended normalization from being applied.

            if self.qk_normalization_by_head_size:

python/sglang/srt/models/glm4_moe_lite.py (911-912)

high

The post_load_weights method is now conditionally called based on getattr(self.config, "mla", False). However, the original comment indicated that DeepseekV2AttentionMLA (which this class uses) always requires post_load_weights to populate packed weights, regardless of whether config.mla is explicitly set. This conditional call might prevent necessary weight processing if config.mla is not explicitly True but DeepseekV2AttentionMLA is still in use, potentially leading to runtime errors or incorrect behavior during CUDA graph capture.

python/sglang/srt/models/glm4v.py (761-762)

high

The condition if name.startswith("lm_head.") and not self.pp_group.is_last_rank: continue was removed. This means lm_head weights will now be processed on all ranks, even if they are not the last rank in a pipeline parallel setup. If PPMissingLayer is still used for lm_head on non-last ranks, loading weights for it is unnecessary and could lead to unexpected behavior or errors, as PPMissingLayer is typically an empty placeholder.

python/sglang/srt/models/glm_ocr.py (279-326)

high

The GlmOcrForConditionalGeneration class inherits from Glm4vForConditionalGeneration but largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.

python/sglang/srt/models/glm_ocr.py (159-160)

medium

Similar to previous comments, GlmOcrVisionPatchEmbed is an empty subclass of Glm4vVisionPatchEmbed. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionPatchEmbed would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (82-83)

medium

Similar to GlmOcrRMSNorm, GlmOcrVisionMLP is an empty subclass of Glm4vVisionMLP. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionMLP would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (109)

medium

The parameter qk_normalization_by_head_size is hardcoded to True. It is generally better practice to make such configuration values configurable, ideally passed through the vision_config object, to allow for greater flexibility and easier experimentation with different model settings.

            qk_normalization_by_head_size=getattr(vision_config, "qk_normalization_by_head_size", True),

python/sglang/srt/configs/utils.py (18-19)

medium

The AutoImageProcessor.register call is commented out and replaced with pass. This effectively disables the registration of customized Hugging Face image processors. If this change is intentional and permanent, the commented-out line should be removed to avoid confusion. If it's a temporary workaround, a clear comment explaining the reason and expected resolution would be beneficial for maintainability.

    # AutoImageProcessor.register(config, None, image_processor, None, exist_ok=True)
    # TODO: Re-enable or remove this line with an explanation if permanently disabled.

python/sglang/srt/models/glm_ocr.py (163-164)

medium

Similar to previous comments, GlmOcrVisionPatchMerger is an empty subclass of Glm4vPatchMerger. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vPatchMerger would suffice, or add a comment explaining the purpose.

python/sglang/srt/models/glm_ocr.py (78-79)

medium

The class GlmOcrRMSNorm is defined as an empty subclass of Glm4vRMSNorm. This creates an alias without adding or modifying any functionality. If the intention is purely for renaming or type hinting, a type alias might be more appropriate. If it's a placeholder for future extensions, a comment should clarify this. Otherwise, it introduces unnecessary abstraction.

python/sglang/srt/models/glm_ocr.py (17)

medium

There is a typo in the comment's URL path. "GlmOcr" should be "glm_ocr" to correctly reflect the typical naming convention for Hugging Face model directories.

# https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm_ocr/modular_glm_ocr.py

python/sglang/srt/models/glm_ocr.py (320)

medium

The expression "mrope_section" in self.config.rope_scaling assumes that self.config.rope_scaling is always a dictionary. If self.config.rope_scaling could be None or another non-dictionary type, this line would raise an AttributeError or TypeError. It's safer to check if self.config.rope_scaling is a dictionary before attempting to access its keys.

        self.is_mrope_enabled = isinstance(self.config.rope_scaling, dict) and "mrope_section" in self.config.rope_scaling

python/sglang/srt/models/glm_ocr.py (441)

medium

Using print(params_dict.keys()) for debugging information is generally discouraged in production code. It's better to use the logger (e.g., logger.error or logger.debug) for consistent logging practices and to allow for configurable log levels.

                    logger.error(f"Parameter {name} not found in params_dict. Available keys: {params_dict.keys()}")

python/sglang/srt/models/glm_ocr_nextn.py (51-55)

medium

This code overrides quant_config to None if it's modelopt_fp4. This seems to be a specific workaround for incompatibility. A comment explaining the reason for this override and the implications for modelopt_fp4 GLM-OCR models would greatly improve code clarity and maintainability.

python/sglang/srt/models/glm_ocr_nextn.py (105-106)

medium

The with get_global_expert_distribution_recorder().disable_this_region(): context manager is used without an accompanying comment explaining why expert distribution recording needs to be disabled in this specific region. Adding a brief explanation would improve code clarity for future maintainers.

python/sglang/srt/models/glm_ocr_nextn.py (126)

medium

Explicitly calling nn.Module.__init__(self) when inheriting from GlmOcrForConditionalGeneration (which itself inherits from nn.Module) is unusual. Typically, super().__init__() is used to ensure proper initialization of the entire inheritance chain. While it might not cause an immediate issue, it could bypass important initialization logic in GlmOcrForConditionalGeneration if that class were to add its own __init__ method later.

        super().__init__()

python/sglang/srt/models/glm_ocr_nextn.py (142-143)

medium

The num_fused_shared_experts is hardcoded to 1 when disable_shared_experts_fusion is false. This value might need to be dynamically determined from the model's configuration, similar to how n_shared_experts is used in Glm4MoeLiteSparseMoeBlock, to ensure flexibility and correctness across different model variants.

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 7106f6c into sgl-project:main Jan 27, 2026
189 of 203 checks passed
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@liyucheng09
Copy link
Copy Markdown

@zRzRzRzRzRzRzR I found no end2end guide on how to run optimized inference with glm ocr with neither sglang/vllm. can you provide a short piece of guide how can we use the best of glm-ocr via sglang?

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation high priority Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants