[GLM-Image] Add batch > 1 support and fix configuration defaults by JaredforReal · Pull Request #43342 · huggingface/transformers

JaredforReal · 2026-01-19T05:05:55Z

What does this PR do?

This PR adds full batch processing support (batch_size > 1) for the GLM-Image model, fixes padding direction for autoregressive generation, and aligns configuration defaults with the official model.

Need to be used with Diffusers GLM Image Batch Support

Problem

Batch restriction: The GlmImageProcessor explicitly rejected batch_size > 1
Missing config defaults: pad_token_id and eos_token_id were not set in GlmImageTextConfig, and max_position_embeddings didn't match the official config.json

Solution

1. Batch support

Processor changes (processing_glm_image.py):

Removed the batch size restriction
Added images_per_sample tensor to track number of grids per sample
Added num_source_images_per_sample tensor to distinguish source images from target grids

Model changes (modeling_glm_image.py via modular_glm_image.py):

Updated get_rope_index() to compute position IDs per-sample with batch support
Updated _cached_decode_position_ids shape from [3, max_len] to [batch, 3, max_len]
Updated forward() to properly handle packed image_grid_thw by splitting per sample
Updated _expand_inputs_for_generation() and prepare_inputs_for_generation() for beam search compatibility

2. Left padding for autoregressive generation

Processor changes (processing_glm_image.py):

Added automatic left padding when padding=True is specified
This ensures proper attention masking for autoregressive image token generation in batch mode

3. Configuration alignment with official model

Config changes (modular_glm_image.py → configuration_glm_image.py):

Added pad_token_id=167841 to GlmImageTextConfig
Added eos_token_id=16385 to GlmImageTextConfig
Updated max_position_embeddings from 32768 to 131072 to match official config.json

Tests

All 64 modeling tests pass. Added test_batch_consistency to verify:

Batch processing produces identical predictions (argmax) as single processing
Identical inputs in a batch produce identical outputs

Breaking changes

None. All new parameters are optional and backward compatible.

Checklist

This PR fixes a bug / adds a feature without changing the model behavior
Tests pass locally
Documentation updated (docstrings for new parameters)
Code follows the project style guidelines

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…uences

… tests

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Copilot

Pull request overview

This PR adds comprehensive batch processing support (batch_size > 1) for the GLM-Image model, enabling efficient parallel image generation. Previously, the processor explicitly rejected batch sizes greater than 1. The implementation introduces two new tracking tensors (images_per_sample and num_source_images_per_sample) to manage packed image grids across batch samples, updates the RoPE position encoding computation to work per-sample, and modifies the generation utilities to correctly expand inputs for beam search.

Changes:

Removed batch size restriction in processor and added per-sample image tracking
Updated position ID computation to handle batches with independent per-sample caching
Modified beam search expansion logic to correctly handle packed visual inputs across batch samples

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/models/glm_image/test_modeling_glm_image.py	Adds test_batch_consistency to verify batch and single processing produce identical predictions
src/transformers/models/glm_image/processing_glm_image.py	Removes batch size restriction, adds image counting per sample, and implements per-sample prompt/grid construction
src/transformers/models/glm_image/modular_glm_image.py	Updates get_rope_index() for per-sample position IDs, modifies forward() to split packed grids, updates generation utilities for batch support
src/transformers/models/glm_image/modeling_glm_image.py	Auto-generated from modular file with matching changes for batch support
src/transformers/models/glm_image/image_processing_glm_image_fast.py	Removes min_pixels/max_pixels class attributes and simplifies initialization
src/transformers/models/glm_image/configuration_glm_image.py	Reorders tie_word_embeddings parameter and removes it from super().init() call

Copilot · 2026-01-19T05:16:53Z

+        num_target_grids = all_target_grids[0].shape[0]
+        image_inputs["images_per_sample"] = torch.tensor(
+            [n + num_target_grids for n in images_per_sample], dtype=torch.long


This line assumes all samples have the same number of target grids by using all_target_grids[0].shape[0]. If different samples could have different numbers of target grids (e.g., due to different is_text_to_image settings), this would cause incorrect counting. Consider validating that all samples have the same num_target_grids, or handle varying target grid counts per sample.

Suggested change

num_target_grids = all_target_grids[0].shape[0]

image_inputs["images_per_sample"] = torch.tensor(

[n + num_target_grids for n in images_per_sample], dtype=torch.long

target_grids_per_sample = [grids.shape[0] for grids in all_target_grids]

image_inputs["images_per_sample"] = torch.tensor(

[n_source + n_target for n_source, n_target in zip(images_per_sample, target_grids_per_sample)],

dtype=torch.long,

Copilot · 2026-01-19T05:16:54Z

+        num_target_grids = all_target_grids[0].shape[0]
+        image_inputs["images_per_sample"] = torch.tensor(
+            [n + num_target_grids for n in images_per_sample], dtype=torch.long
        )


This line assumes all samples have the same number of target grids by using all_target_grids[0].shape[0]. If different samples could have different numbers of target grids (e.g., due to different is_text_to_image settings), this would cause incorrect counting. Consider validating that all samples have the same num_target_grids, or handle varying target grid counts per sample.

Suggested change

num_target_grids = all_target_grids[0].shape[0]

image_inputs["images_per_sample"] = torch.tensor(

[n + num_target_grids for n in images_per_sample], dtype=torch.long

)

num_target_grids_per_sample = [g.shape[0] for g in all_target_grids]

if len(set(num_target_grids_per_sample)) == 1:

num_target_grids = num_target_grids_per_sample[0]

images_per_sample_with_targets = [n + num_target_grids for n in images_per_sample]

else:

images_per_sample_with_targets = [n + t for n, t in zip(images_per_sample, num_target_grids_per_sample)]

image_inputs["images_per_sample"] = torch.tensor(images_per_sample_with_targets, dtype=torch.long)

Copilot · 2026-01-19T05:16:54Z

+
+                        dict_to_expand[key] = _repeat_interleave_samples(
+                            dict_to_expand[key], lengths=lengths, repeat_times=expand_size
+                        )


When splitting pixel_values for beam search expansion, if sum(source_image_nums) == 0 (no source images), the pixel_values tensor is not handled. While the code checks if sum > 0 before processing, it doesn't explicitly handle the else case where pixel_values should remain unchanged or be set appropriately. Consider adding an explicit else clause or handling for the case where there are no source images.

Suggested change

)

)

else:

# No source images: leave pixel_values unchanged

dict_to_expand[key] = dict_to_expand[key]

Copilot · 2026-01-19T05:16:55Z

-        super().__init__(
-            tie_word_embeddings=tie_word_embeddings, ignore_keys_at_rope_validation={"mrope_section"}, **kwargs
-        )
+        super().__init__(ignore_keys_at_rope_validation={"mrope_section"}, **kwargs)


The super().init() call no longer passes tie_word_embeddings. This means the parent class PreTrainedConfig won't receive this parameter. Verify that the parent class correctly handles tie_word_embeddings through **kwargs, or explicitly pass it if needed.

Suggested change

super().__init__(ignore_keys_at_rope_validation={"mrope_section"}, **kwargs)

super().__init__(

tie_word_embeddings=tie_word_embeddings,

ignore_keys_at_rope_validation={"mrope_section"},

**kwargs,

)

Copilot · 2026-01-19T05:16:56Z

    patch_size = 14
    temporal_patch_size = 2
    merge_size = 2
-    min_pixels = None
-    max_pixels = None
    valid_kwargs = GlmImageImageProcessorKwargs


The removal of min_pixels and max_pixels as class attributes could be a breaking change for users who directly access these attributes (e.g., processor.min_pixels). Consider documenting this change or maintaining backward compatibility by keeping them as None defaults in the class definition.

Copilot · 2026-01-19T05:16:56Z

+def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`torch.Tensor`): The query tensor.
        k (`torch.Tensor`): The key tensor.
        cos (`torch.Tensor`): The cosine part of the rotary embedding.
        sin (`torch.Tensor`): The sine part of the rotary embedding.
-        position_ids (`torch.Tensor`, *optional*):
-            Deprecated and unused.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note


The unused position_ids parameter has been removed from the function signature. However, the docstring still references this parameter in the Args section. The docstring should be updated to reflect this change.

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…gnature

…m search

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal · 2026-01-19T12:29:44Z

Done

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal · 2026-01-28T08:57:21Z

@zucchini-nlp updated, PTAL

zucchini-nlp

Thanks, lats round of comments and we're good!

zucchini-nlp · 2026-01-28T09:14:05Z

+                    # Fallback for text-to-image or cases without cached decode positions
+                    # Use simple incremental positions
+                    start_pos = cache_position[0].item()
+                    position_ids = torch.arange(
+                        start_pos, start_pos + seq_length, device=inputs_embeds.device, dtype=torch.long
+                    )
+                    position_ids = position_ids.unsqueeze(0).repeat(3, batch_size, 1)


nit: same as cache_position[None, None, :].repeat(3, batch_size, 1)

zucchini-nlp · 2026-01-28T09:16:27Z

        self,
        pixel_values: torch.FloatTensor,
        image_grid_thw: torch.LongTensor | None = None,
+        return_dict: bool | None = None,


no, we don't need return_dict as explicit arg here. We just can pass it over to model.model.get_image_features which handles return_dict via a decorator

Can you make a pass-through like in GLM4V?

transformers/src/transformers/models/glm4v/modeling_glm4v.py

Lines 1416 to 1429 in be0115e

@auto_docstring

def get_image_features(

self,

pixel_values: torch.FloatTensor,

image_grid_thw: torch.LongTensor | None = None,

**kwargs: Unpack[TransformersKwargs],

) -> tuple | BaseModelOutputWithPooling:

r"""

pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):

The tensors corresponding to the input images.

image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):

The temporal, height and width of feature shape of each image in LLM.

"""

return self.model.get_image_features(pixel_values=pixel_values, image_grid_thw=image_grid_thw, **kwargs)

zucchini-nlp · 2026-01-28T09:26:27Z

+
+    @unittest.skip(
+        reason="GLM-Image processor adds image placeholder tokens which makes sequence length depend on image size"
+    )
+    def test_kwargs_overrides_default_tokenizer_kwargs(self):
+        pass
+
+    @unittest.skip(


this should be handled already by tests, since we have many models that expand text with placeholders.

We need to make sure the test has self.image_token and override the sizes in image processor to tiny values. It will produce small amount of placeholder tokens in that case

Like this:

transformers/tests/models/glm4v/test_processor_glm4v.py

Lines 39 to 51 in be0115e

@classmethod

def _setup_test_attributes(cls, processor):

cls.image_token = processor.image_token

@classmethod

def _setup_from_pretrained(cls, model_id, **kwargs):

return super()._setup_from_pretrained(

model_id,

do_sample_frames=False,

patch_size=4,

size={"shortest_edge": 12 * 12, "longest_edge": 18 * 18},

**kwargs,

)

Signed-off-by: JaredforReal <w13431838023@gmail.com>

zucchini-nlp

Great work, thanks! One last comment, I see that you added an mrope_section property in config. If a field has default value, we prefer to set it during __init__

zucchini-nlp · 2026-01-29T11:00:51Z

+    @property
+    def mrope_section(self) -> list[int]:
+        """Return mrope_section from rope_parameters for vLLM MROPE detection."""
+        if self.rope_parameters is not None:
+            return self.rope_parameters.get("mrope_section", [8, 12, 12])
+        return [8, 12, 12]


not sure about this one. We are always storing all rope params in self.rope_parameters so if mrope_section1has a default value, it needs to be set in the config's __init__

I sadly found this mrope_section useless for vllm mrope detection, Deleted right now...

zucchini-nlp · 2026-01-29T11:04:15Z

run-slow: glm_image

github-actions · 2026-01-29T11:05:28Z

This comment contains run-slow, running the specified jobs:

models: ["models/glm_image"]
quantizations: []

github-actions · 2026-01-29T11:24:52Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

glm_image:
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_image_to_image_generation
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_processor_image_to_image
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_processor_text_to_image
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_text_to_image_generation

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal · 2026-01-29T11:58:35Z

@zucchini-nlp thanks!

Signed-off-by: JaredforReal <w13431838023@gmail.com>

zucchini-nlp · 2026-01-29T12:44:36Z

run-slow: glm_image

github-actions · 2026-01-29T12:45:51Z

This comment contains run-slow, running the specified jobs:

models: ["models/glm_image"]
quantizations: []

github-actions · 2026-01-29T13:09:57Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

glm_image:
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_image_to_image_generation
tests/models/glm_image/test_modeling_glm_image.py::GlmImageIntegrationTest::test_text_to_image_generation

Signed-off-by: JaredforReal <w13431838023@gmail.com>

github-actions · 2026-01-29T13:55:40Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm_image

zucchini-nlp · 2026-01-29T14:00:56Z

run-slow: glm_image

zucchini-nlp · 2026-01-29T14:15:41Z

run-slow: glm_image

github-actions · 2026-01-29T14:17:04Z

This comment contains run-slow, running the specified jobs:

models: ["models/glm_image"]
quantizations: []

github-actions · 2026-01-29T14:36:17Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

JaredforReal added 14 commits January 19, 2026 10:33

support batch > 1

7614f04

Signed-off-by: JaredforReal <w13431838023@gmail.com>

test glm image batch

4d8d9cb

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update test

38e0498

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix for i2i batch

5d05f06

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Fix test 6 to use load_processor, improve test 3 with same-length seq…

14ad254

…uences

Add actual test content for Test 6: different image counts

6171c8d

Add token-level consistency verification to Test 6

50d1436

Add debug test to investigate batch vs single differences

bf238a2

Add deeper debug: check attention mask and language model directly

f2daf5d

Add rotary embedding and first layer debug

451a51e

Add argmax and top-5 consistency check

fd7fc4e

Update tests to use argmax consistency instead of logit diff; run all…

2be52d8

… tests

Add batch consistency test for GlmImage

f660444

remove test_glm_image_batch.py

df94622

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Copilot AI review requested due to automatic review settings January 19, 2026 05:05

Copilot started reviewing on behalf of JaredforReal January 19, 2026 05:06 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

JaredforReal added 9 commits January 19, 2026 13:18

fix for CI error

de9443e

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Fix GlmImageTextConfig docstring parameter order to match __init__ si…

02b2072

…gnature

Fix _expand_inputs_for_generation for uneven grid distribution in bea…

ba6a3c9

…m search

Merge branch 'main' into feat/glmimage-batch

24b6a22

try pass CI

1bb4790

Signed-off-by: JaredforReal <w13431838023@gmail.com>

try pass check_repo_consistency

89095e8

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix test

6fef4ff

Signed-off-by: JaredforReal <w13431838023@gmail.com>

change from right padding to left padding

ba1420e

Signed-off-by: JaredforReal <w13431838023@gmail.com>

align configuration with config.json

0bd1268

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal changed the title ~~[GLM-Image] Add batch > 1 support for image generation~~ [GLM-Image] Add batch > 1 support and fix configuration defaults Jan 19, 2026

JaredforReal added 2 commits January 20, 2026 11:20

rename

f1802f0

Signed-off-by: JaredforReal <w13431838023@gmail.com>

rename to correct the semantic logic

2a7bd4d

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal added 2 commits January 28, 2026 15:42

update expected_tokens

e146b8e

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Merge branch 'main' into feat/glmimage-batch

8013057

zucchini-nlp reviewed Jan 28, 2026

View reviewed changes

JaredforReal added 4 commits January 28, 2026 22:50

update tests

70a51ec

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Merge branch 'main' into feat/glmimage-batch

d870990

update configs

1a5286a

Signed-off-by: JaredforReal <w13431838023@gmail.com>

docstring

ee4a877

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal force-pushed the feat/glmimage-batch branch from 6e69aad to ee4a877 Compare January 28, 2026 16:18

JaredforReal added 3 commits January 29, 2026 00:57

docstring

2aebbea

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update get_image_feature wrapper

1ea1a65

Signed-off-by: JaredforReal <w13431838023@gmail.com>

return image_inputs

3827790

Signed-off-by: JaredforReal <w13431838023@gmail.com>

zucchini-nlp approved these changes Jan 29, 2026

View reviewed changes

remove mrope_section

4f54b5c

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal added 2 commits January 29, 2026 20:35

update slow tests

523832c

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix-repo

37fdd9c

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update expected tokens

000e3fd

Signed-off-by: JaredforReal <w13431838023@gmail.com>

zucchini-nlp merged commit 5807e0d into huggingface:main Jan 29, 2026
21 checks passed

-        num_target_grids = all_target_grids[0].shape[0]
-        image_inputs["images_per_sample"] = torch.tensor(
-            [n + num_target_grids for n in images_per_sample], dtype=torch.long
-        )
+        num_target_grids_per_sample = [g.shape[0] for g in all_target_grids]
+        if len(set(num_target_grids_per_sample)) == 1:
+            num_target_grids = num_target_grids_per_sample[0]
+            images_per_sample_with_targets = [n + num_target_grids for n in images_per_sample]
+        else:
+            images_per_sample_with_targets = [n + t for n, t in zip(images_per_sample, num_target_grids_per_sample)]
+        image_inputs["images_per_sample"] = torch.tensor(images_per_sample_with_targets, dtype=torch.long)

-                        )
+                        )
+                    else:
+                        # No source images: leave pixel_values unchanged
+                        dict_to_expand[key] = dict_to_expand[key]

	@auto_docstring
	def get_image_features(
	self,
	pixel_values: torch.FloatTensor,
	image_grid_thw: torch.LongTensor \| None = None,
	**kwargs: Unpack[TransformersKwargs],
	) -> tuple \| BaseModelOutputWithPooling:
	r"""
	pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
	The tensors corresponding to the input images.
	image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, optional):
	The temporal, height and width of feature shape of each image in LLM.
	"""
	return self.model.get_image_features(pixel_values=pixel_values, image_grid_thw=image_grid_thw, **kwargs)

	@classmethod
	def _setup_test_attributes(cls, processor):
	cls.image_token = processor.image_token

	@classmethod
	def _setup_from_pretrained(cls, model_id, **kwargs):
	return super()._setup_from_pretrained(
	model_id,
	do_sample_frames=False,
	patch_size=4,
	size={"shortest_edge": 12 * 12, "longest_edge": 18 * 18},
	**kwargs,
	)

Conversation

JaredforReal commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Problem

Solution

1. Batch support

2. Left padding for autoregressive generation

3. Configuration alignment with official model

Tests

Breaking changes

Checklist

Before submitting

Who can review?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

JaredforReal commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JaredforReal commented Jan 28, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

JaredforReal Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

Uh oh!

github-actions Bot commented Jan 29, 2026

CI Results

Model CI Report

❌ Failed tests

Uh oh!

JaredforReal commented Jan 19, 2026 •

edited

Loading

JaredforReal commented Jan 19, 2026 •

edited

Loading