[GLM-OCR] Support GLM-OCR Model#17582
Conversation
Summary of ChangesHello @zRzRzRzRzRzRzR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for the GLM-OCR model, a multimodal model designed for accurate and fast document parsing with robust optical character recognition. The changes encompass the integration of its vision and language components, including specialized attention mechanisms and weight loading procedures. It also prepares the model for efficient inference by enabling speculative decoding and leveraging Flash Attention 3, while adjusting core library dependencies to meet its requirements. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the GLM-OCR model, including updates to model configurations, attention mechanisms, and the addition of new model files. The changes involve integrating GLM-OCR into the existing multimodal framework, adjusting transformer version requirements, and refining attention logic for specific model types. Overall, the changes aim to expand the range of supported multimodal language models.
I am having trouble creating individual review comments. Click here to see my feedback.
python/sglang/srt/models/glm_ocr.py (167-238)
The GlmOcrVisionModel class inherits from Glm4vVisionModel but then largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This approach reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.
python/sglang/srt/layers/attention/vision.py (817)
There is a typo in the attribute name. It should be self.qk_normalization_by_head_size instead of self.qk_norm_by_head_size to match the attribute defined in the __init__ method. This will cause the condition to always be false, preventing the intended normalization from being applied.
if self.qk_normalization_by_head_size:
python/sglang/srt/models/glm4_moe_lite.py (911-912)
The post_load_weights method is now conditionally called based on getattr(self.config, "mla", False). However, the original comment indicated that DeepseekV2AttentionMLA (which this class uses) always requires post_load_weights to populate packed weights, regardless of whether config.mla is explicitly set. This conditional call might prevent necessary weight processing if config.mla is not explicitly True but DeepseekV2AttentionMLA is still in use, potentially leading to runtime errors or incorrect behavior during CUDA graph capture.
python/sglang/srt/models/glm4v.py (761-762)
The condition if name.startswith("lm_head.") and not self.pp_group.is_last_rank: continue was removed. This means lm_head weights will now be processed on all ranks, even if they are not the last rank in a pipeline parallel setup. If PPMissingLayer is still used for lm_head on non-last ranks, loading weights for it is unnecessary and could lead to unexpected behavior or errors, as PPMissingLayer is typically an empty placeholder.
python/sglang/srt/models/glm_ocr.py (279-326)
The GlmOcrForConditionalGeneration class inherits from Glm4vForConditionalGeneration but largely re-implements the __init__ method, duplicating significant portions of code from the parent class. While super().__init__ is called, many attributes are then re-initialized immediately after. This reduces code reusability and makes maintenance harder. It would be more efficient to only override or add specific attributes/logic that differ from the parent class, or to reconsider inheritance if the __init__ logic is fundamentally different.
python/sglang/srt/models/glm_ocr.py (159-160)
Similar to previous comments, GlmOcrVisionPatchEmbed is an empty subclass of Glm4vVisionPatchEmbed. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionPatchEmbed would suffice, or add a comment explaining the purpose.
python/sglang/srt/models/glm_ocr.py (82-83)
Similar to GlmOcrRMSNorm, GlmOcrVisionMLP is an empty subclass of Glm4vVisionMLP. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vVisionMLP would suffice, or add a comment explaining the purpose.
python/sglang/srt/models/glm_ocr.py (109)
The parameter qk_normalization_by_head_size is hardcoded to True. It is generally better practice to make such configuration values configurable, ideally passed through the vision_config object, to allow for greater flexibility and easier experimentation with different model settings.
qk_normalization_by_head_size=getattr(vision_config, "qk_normalization_by_head_size", True),
python/sglang/srt/configs/utils.py (18-19)
The AutoImageProcessor.register call is commented out and replaced with pass. This effectively disables the registration of customized Hugging Face image processors. If this change is intentional and permanent, the commented-out line should be removed to avoid confusion. If it's a temporary workaround, a clear comment explaining the reason and expected resolution would be beneficial for maintainability.
# AutoImageProcessor.register(config, None, image_processor, None, exist_ok=True)
# TODO: Re-enable or remove this line with an explanation if permanently disabled.
python/sglang/srt/models/glm_ocr.py (163-164)
Similar to previous comments, GlmOcrVisionPatchMerger is an empty subclass of Glm4vPatchMerger. This creates an alias without adding or modifying any functionality. Consider if this level of abstraction is necessary or if a direct use of Glm4vPatchMerger would suffice, or add a comment explaining the purpose.
python/sglang/srt/models/glm_ocr.py (78-79)
The class GlmOcrRMSNorm is defined as an empty subclass of Glm4vRMSNorm. This creates an alias without adding or modifying any functionality. If the intention is purely for renaming or type hinting, a type alias might be more appropriate. If it's a placeholder for future extensions, a comment should clarify this. Otherwise, it introduces unnecessary abstraction.
python/sglang/srt/models/glm_ocr.py (17)
There is a typo in the comment's URL path. "GlmOcr" should be "glm_ocr" to correctly reflect the typical naming convention for Hugging Face model directories.
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm_ocr/modular_glm_ocr.py
python/sglang/srt/models/glm_ocr.py (320)
The expression "mrope_section" in self.config.rope_scaling assumes that self.config.rope_scaling is always a dictionary. If self.config.rope_scaling could be None or another non-dictionary type, this line would raise an AttributeError or TypeError. It's safer to check if self.config.rope_scaling is a dictionary before attempting to access its keys.
self.is_mrope_enabled = isinstance(self.config.rope_scaling, dict) and "mrope_section" in self.config.rope_scaling
python/sglang/srt/models/glm_ocr.py (441)
Using print(params_dict.keys()) for debugging information is generally discouraged in production code. It's better to use the logger (e.g., logger.error or logger.debug) for consistent logging practices and to allow for configurable log levels.
logger.error(f"Parameter {name} not found in params_dict. Available keys: {params_dict.keys()}")
python/sglang/srt/models/glm_ocr_nextn.py (51-55)
This code overrides quant_config to None if it's modelopt_fp4. This seems to be a specific workaround for incompatibility. A comment explaining the reason for this override and the implications for modelopt_fp4 GLM-OCR models would greatly improve code clarity and maintainability.
python/sglang/srt/models/glm_ocr_nextn.py (105-106)
The with get_global_expert_distribution_recorder().disable_this_region(): context manager is used without an accompanying comment explaining why expert distribution recording needs to be disabled in this specific region. Adding a brief explanation would improve code clarity for future maintainers.
python/sglang/srt/models/glm_ocr_nextn.py (126)
Explicitly calling nn.Module.__init__(self) when inheriting from GlmOcrForConditionalGeneration (which itself inherits from nn.Module) is unusual. Typically, super().__init__() is used to ensure proper initialization of the entire inheritance chain. While it might not cause an immediate issue, it could bypass important initialization logic in GlmOcrForConditionalGeneration if that class were to add its own __init__ method later.
super().__init__()
python/sglang/srt/models/glm_ocr_nextn.py (142-143)
The num_fused_shared_experts is hardcoded to 1 when disable_shared_experts_fusion is false. This value might need to be dynamically determined from the model's configuration, similar to how n_shared_experts is used in Glm4MoeLiteSparseMoeBlock, to ensure flexibility and correctness across different model variants.
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
@zRzRzRzRzRzRzR I found no end2end guide on how to run optimized inference with glm ocr with neither sglang/vllm. can you provide a short piece of guide how can we use the best of glm-ocr via sglang? |
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Support for GLM-OCR Model, transformers PR here
Need transformer>=5.0.0dev0, but not 5.0.0, so also changed min requirements in glm4v.