[GLM-Image] Add batch > 1 support and fix configuration defaults#43342
[GLM-Image] Add batch > 1 support and fix configuration defaults#43342zucchini-nlp merged 68 commits intohuggingface:mainfrom
Conversation
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive batch processing support (batch_size > 1) for the GLM-Image model, enabling efficient parallel image generation. Previously, the processor explicitly rejected batch sizes greater than 1. The implementation introduces two new tracking tensors (images_per_sample and num_source_images_per_sample) to manage packed image grids across batch samples, updates the RoPE position encoding computation to work per-sample, and modifies the generation utilities to correctly expand inputs for beam search.
Changes:
- Removed batch size restriction in processor and added per-sample image tracking
- Updated position ID computation to handle batches with independent per-sample caching
- Modified beam search expansion logic to correctly handle packed visual inputs across batch samples
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/models/glm_image/test_modeling_glm_image.py | Adds test_batch_consistency to verify batch and single processing produce identical predictions |
| src/transformers/models/glm_image/processing_glm_image.py | Removes batch size restriction, adds image counting per sample, and implements per-sample prompt/grid construction |
| src/transformers/models/glm_image/modular_glm_image.py | Updates get_rope_index() for per-sample position IDs, modifies forward() to split packed grids, updates generation utilities for batch support |
| src/transformers/models/glm_image/modeling_glm_image.py | Auto-generated from modular file with matching changes for batch support |
| src/transformers/models/glm_image/image_processing_glm_image_fast.py | Removes min_pixels/max_pixels class attributes and simplifies initialization |
| src/transformers/models/glm_image/configuration_glm_image.py | Reorders tie_word_embeddings parameter and removes it from super().init() call |
| num_target_grids = all_target_grids[0].shape[0] | ||
| image_inputs["images_per_sample"] = torch.tensor( | ||
| [n + num_target_grids for n in images_per_sample], dtype=torch.long |
There was a problem hiding this comment.
This line assumes all samples have the same number of target grids by using all_target_grids[0].shape[0]. If different samples could have different numbers of target grids (e.g., due to different is_text_to_image settings), this would cause incorrect counting. Consider validating that all samples have the same num_target_grids, or handle varying target grid counts per sample.
| num_target_grids = all_target_grids[0].shape[0] | |
| image_inputs["images_per_sample"] = torch.tensor( | |
| [n + num_target_grids for n in images_per_sample], dtype=torch.long | |
| target_grids_per_sample = [grids.shape[0] for grids in all_target_grids] | |
| image_inputs["images_per_sample"] = torch.tensor( | |
| [n_source + n_target for n_source, n_target in zip(images_per_sample, target_grids_per_sample)], | |
| dtype=torch.long, |
| num_target_grids = all_target_grids[0].shape[0] | ||
| image_inputs["images_per_sample"] = torch.tensor( | ||
| [n + num_target_grids for n in images_per_sample], dtype=torch.long | ||
| ) |
There was a problem hiding this comment.
This line assumes all samples have the same number of target grids by using all_target_grids[0].shape[0]. If different samples could have different numbers of target grids (e.g., due to different is_text_to_image settings), this would cause incorrect counting. Consider validating that all samples have the same num_target_grids, or handle varying target grid counts per sample.
| num_target_grids = all_target_grids[0].shape[0] | |
| image_inputs["images_per_sample"] = torch.tensor( | |
| [n + num_target_grids for n in images_per_sample], dtype=torch.long | |
| ) | |
| num_target_grids_per_sample = [g.shape[0] for g in all_target_grids] | |
| if len(set(num_target_grids_per_sample)) == 1: | |
| num_target_grids = num_target_grids_per_sample[0] | |
| images_per_sample_with_targets = [n + num_target_grids for n in images_per_sample] | |
| else: | |
| images_per_sample_with_targets = [n + t for n, t in zip(images_per_sample, num_target_grids_per_sample)] | |
| image_inputs["images_per_sample"] = torch.tensor(images_per_sample_with_targets, dtype=torch.long) |
|
|
||
| dict_to_expand[key] = _repeat_interleave_samples( | ||
| dict_to_expand[key], lengths=lengths, repeat_times=expand_size | ||
| ) |
There was a problem hiding this comment.
When splitting pixel_values for beam search expansion, if sum(source_image_nums) == 0 (no source images), the pixel_values tensor is not handled. While the code checks if sum > 0 before processing, it doesn't explicitly handle the else case where pixel_values should remain unchanged or be set appropriately. Consider adding an explicit else clause or handling for the case where there are no source images.
| ) | |
| ) | |
| else: | |
| # No source images: leave pixel_values unchanged | |
| dict_to_expand[key] = dict_to_expand[key] |
| super().__init__( | ||
| tie_word_embeddings=tie_word_embeddings, ignore_keys_at_rope_validation={"mrope_section"}, **kwargs | ||
| ) | ||
| super().__init__(ignore_keys_at_rope_validation={"mrope_section"}, **kwargs) |
There was a problem hiding this comment.
The super().init() call no longer passes tie_word_embeddings. This means the parent class PreTrainedConfig won't receive this parameter. Verify that the parent class correctly handles tie_word_embeddings through **kwargs, or explicitly pass it if needed.
| super().__init__(ignore_keys_at_rope_validation={"mrope_section"}, **kwargs) | |
| super().__init__( | |
| tie_word_embeddings=tie_word_embeddings, | |
| ignore_keys_at_rope_validation={"mrope_section"}, | |
| **kwargs, | |
| ) |
| patch_size = 14 | ||
| temporal_patch_size = 2 | ||
| merge_size = 2 | ||
| min_pixels = None | ||
| max_pixels = None | ||
| valid_kwargs = GlmImageImageProcessorKwargs |
There was a problem hiding this comment.
The removal of min_pixels and max_pixels as class attributes could be a breaking change for users who directly access these attributes (e.g., processor.min_pixels). Consider documenting this change or maintaining backward compatibility by keeping them as None defaults in the class definition.
| def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1): | ||
| """Applies Rotary Position Embedding to the query and key tensors. | ||
|
|
||
| Args: | ||
| q (`torch.Tensor`): The query tensor. | ||
| k (`torch.Tensor`): The key tensor. | ||
| cos (`torch.Tensor`): The cosine part of the rotary embedding. | ||
| sin (`torch.Tensor`): The sine part of the rotary embedding. | ||
| position_ids (`torch.Tensor`, *optional*): | ||
| Deprecated and unused. | ||
| unsqueeze_dim (`int`, *optional*, defaults to 1): | ||
| The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and | ||
| sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note |
There was a problem hiding this comment.
The unused position_ids parameter has been removed from the function signature. However, the docstring still references this parameter in the Args section. The docstring should be updated to reflect this change.
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
Done |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
@zucchini-nlp updated, PTAL |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks, lats round of comments and we're good!
| # Fallback for text-to-image or cases without cached decode positions | ||
| # Use simple incremental positions | ||
| start_pos = cache_position[0].item() | ||
| position_ids = torch.arange( | ||
| start_pos, start_pos + seq_length, device=inputs_embeds.device, dtype=torch.long | ||
| ) | ||
| position_ids = position_ids.unsqueeze(0).repeat(3, batch_size, 1) |
There was a problem hiding this comment.
nit: same as cache_position[None, None, :].repeat(3, batch_size, 1)
| self, | ||
| pixel_values: torch.FloatTensor, | ||
| image_grid_thw: torch.LongTensor | None = None, | ||
| return_dict: bool | None = None, |
There was a problem hiding this comment.
no, we don't need return_dict as explicit arg here. We just can pass it over to model.model.get_image_features which handles return_dict via a decorator
Can you make a pass-through like in GLM4V?
transformers/src/transformers/models/glm4v/modeling_glm4v.py
Lines 1416 to 1429 in be0115e
|
|
||
| @unittest.skip( | ||
| reason="GLM-Image processor adds image placeholder tokens which makes sequence length depend on image size" | ||
| ) | ||
| def test_kwargs_overrides_default_tokenizer_kwargs(self): | ||
| pass | ||
|
|
||
| @unittest.skip( |
There was a problem hiding this comment.
this should be handled already by tests, since we have many models that expand text with placeholders.
We need to make sure the test has self.image_token and override the sizes in image processor to tiny values. It will produce small amount of placeholder tokens in that case
Like this:
transformers/tests/models/glm4v/test_processor_glm4v.py
Lines 39 to 51 in be0115e
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
6e69aad to
ee4a877
Compare
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
zucchini-nlp
left a comment
There was a problem hiding this comment.
Great work, thanks! One last comment, I see that you added an mrope_section property in config. If a field has default value, we prefer to set it during __init__
| @property | ||
| def mrope_section(self) -> list[int]: | ||
| """Return mrope_section from rope_parameters for vLLM MROPE detection.""" | ||
| if self.rope_parameters is not None: | ||
| return self.rope_parameters.get("mrope_section", [8, 12, 12]) | ||
| return [8, 12, 12] |
There was a problem hiding this comment.
not sure about this one. We are always storing all rope params in self.rope_parameters so if mrope_section1has a default value, it needs to be set in the config's __init__
There was a problem hiding this comment.
I sadly found this mrope_section useless for vllm mrope detection, Deleted right now...
|
run-slow: glm_image |
|
This comment contains models: ["models/glm_image"] |
CI ResultsModel CI Report❌ Failed tests
|
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
@zucchini-nlp thanks! |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
run-slow: glm_image |
|
This comment contains models: ["models/glm_image"] |
CI ResultsModel CI Report❌ Failed tests
|
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: glm_image |
|
run-slow: glm_image |
1 similar comment
|
run-slow: glm_image |
|
This comment contains models: ["models/glm_image"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
What does this PR do?
This PR adds full batch processing support (batch_size > 1) for the GLM-Image model, fixes padding direction for autoregressive generation, and aligns configuration defaults with the official model.
Need to be used with Diffusers GLM Image Batch Support
Problem
GlmImageProcessorexplicitly rejected batch_size > 1pad_token_idandeos_token_idwere not set inGlmImageTextConfig, andmax_position_embeddingsdidn't match the official config.jsonSolution
1. Batch support
Processor changes (
processing_glm_image.py):images_per_sampletensor to track number of grids per samplenum_source_images_per_sampletensor to distinguish source images from target gridsModel changes (
modeling_glm_image.pyviamodular_glm_image.py):get_rope_index()to compute position IDs per-sample with batch support_cached_decode_position_idsshape from[3, max_len]to[batch, 3, max_len]forward()to properly handle packedimage_grid_thwby splitting per sample_expand_inputs_for_generation()andprepare_inputs_for_generation()for beam search compatibility2. Left padding for autoregressive generation
Processor changes (
processing_glm_image.py):padding=Trueis specified3. Configuration alignment with official model
Config changes (
modular_glm_image.py→configuration_glm_image.py):pad_token_id=167841toGlmImageTextConfigeos_token_id=16385toGlmImageTextConfigmax_position_embeddingsfrom 32768 to 131072 to match official config.jsonTests
All 64 modeling tests pass. Added
test_batch_consistencyto verify:Breaking changes
None. All new parameters are optional and backward compatible.
Checklist
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.