Fix non FA2 tests after FA2 installed in CI docker image by ydshieh · Pull Request #40430 · huggingface/transformers

ydshieh · 2025-08-25T13:23:17Z

What does this PR do?

HuggingFaceDocBuilderDev · 2025-08-25T14:25:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ydshieh · 2025-08-25T15:21:02Z

tests/models/glm4v/test_modeling_glm4v.py

        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", dtype=torch.float16, device_map="auto"
        )
-        questions = ["Describe this video."] * 2


don't use batch 2, otherwise OOM. It's fine to simply to test batch 1 here, the goal is to check if video works

Hmm, we have the same for images at

transformers/tests/models/glm4v/test_modeling_glm4v.py

Lines 314 to 370 in 8828b2e

@slow

def test_small_model_integration_test(self):

model = Glm4vForConditionalGeneration.from_pretrained(

"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"

)

inputs = self.processor.apply_chat_template(

self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"

)

expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip

assert expected_input_ids == inputs.input_ids[0].tolist()[:17]

expected_pixel_slice = torch.tensor(

[

[-0.0988, -0.0842, -0.0842],

[-0.5660, -0.5514, -0.4200],

[-0.0259, -0.0259, -0.0259],

[-0.1280, -0.0988, -0.2010],

[-0.4638, -0.5806, -0.6974],

[-1.2083, -1.2229, -1.2083],

],

dtype=torch.float32,

device="cpu",

)

assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3)

# verify generation

inputs = inputs.to(torch_device)

output = model.generate(**inputs, max_new_tokens=30)

EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"

self.assertEqual(

self.processor.decode(output[0], skip_special_tokens=True),

EXPECTED_DECODED_TEXT,

)

@slow

def test_small_model_integration_test_batch(self):

model = Glm4vForConditionalGeneration.from_pretrained(

"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"

)

batch_messages = [self.message] * 2

inputs = self.processor.apply_chat_template(

batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"

).to(torch_device)

# it should not matter whether two images are the same size or not

output = model.generate(**inputs, max_new_tokens=30)

EXPECTED_DECODED_TEXT = [

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"

] # fmt: skip

self.assertEqual(

self.processor.batch_decode(output, skip_special_tokens=True),

EXPECTED_DECODED_TEXT,

)

Would it be possible to have something similar for videos? Fine with this tho as well, just not sure if batching x videos could have some different issues than batching x image

I will just leave as it is. Don't have enough bandwidth. It never pass anyway.

tests/models/glm4v/test_modeling_glm4v.py

ydshieh · 2025-08-25T15:23:07Z

tests/models/mistral/test_modeling_mistral.py

        model_id = "mistralai/Mistral-7B-v0.1"
        EXPECTED_COMPLETIONS = [
-            "This is a nice place. This is a nice place. This is a nice place. This is",
+            "scenery, scenery, scenery, scenery, scenery,",


due to the 800 --> 682 below

Repetition goes wild 😆

tests/models/mistral/test_modeling_mistral.py

ydshieh · 2025-08-25T15:23:38Z

tests/models/mistral/test_modeling_mistral.py

+        if attn_implementation in ["flex_attention", "eager"]:
+            input_text = input_text[:1]


eager still OOM with 682, just make it batch size 1

Small comment to add ^

tests/models/qwen2/test_modeling_qwen2.py

ydshieh · 2025-08-25T15:25:37Z

tests/models/qwen3/test_modeling_qwen3.py

    @pytest.mark.flash_attn_test
    def test_model_600m_long_prompt(self):
-        EXPECTED_OUTPUT_TOKEN_IDS = [306, 338]
+        EXPECTED_OUTPUT_TOKEN_IDS = [198, 198]


never run before. The sdpa version of this test already use this 198

Maybe move to same test and parametrize instead? Looks like another parity check between sdpa and flash that was hidden

look at the 2 tests, one is doing more work than another and loading is also different (4-bit or not). I will simply keep as they are

Ic, makes sense didnt look in detail myself ^^

vasqu

LGTM overall! Just some smaller things and good to see that we now run these hidden tests instead

vasqu · 2025-08-25T15:38:17Z

tests/models/glm4v/test_modeling_glm4v.py

        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", dtype=torch.float16, device_map="auto"
        )
-        questions = ["Describe this video."] * 2


Hmm, we have the same for images at

transformers/tests/models/glm4v/test_modeling_glm4v.py

Lines 314 to 370 in 8828b2e

@slow

def test_small_model_integration_test(self):

model = Glm4vForConditionalGeneration.from_pretrained(

"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"

)

inputs = self.processor.apply_chat_template(

self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"

)

expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip

assert expected_input_ids == inputs.input_ids[0].tolist()[:17]

expected_pixel_slice = torch.tensor(

[

[-0.0988, -0.0842, -0.0842],

[-0.5660, -0.5514, -0.4200],

[-0.0259, -0.0259, -0.0259],

[-0.1280, -0.0988, -0.2010],

[-0.4638, -0.5806, -0.6974],

[-1.2083, -1.2229, -1.2083],

],

dtype=torch.float32,

device="cpu",

)

assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3)

# verify generation

inputs = inputs.to(torch_device)

output = model.generate(**inputs, max_new_tokens=30)

EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"

self.assertEqual(

self.processor.decode(output[0], skip_special_tokens=True),

EXPECTED_DECODED_TEXT,

)

@slow

def test_small_model_integration_test_batch(self):

model = Glm4vForConditionalGeneration.from_pretrained(

"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"

)

batch_messages = [self.message] * 2

inputs = self.processor.apply_chat_template(

batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"

).to(torch_device)

# it should not matter whether two images are the same size or not

output = model.generate(**inputs, max_new_tokens=30)

EXPECTED_DECODED_TEXT = [

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",

"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"

] # fmt: skip

self.assertEqual(

self.processor.batch_decode(output, skip_special_tokens=True),

EXPECTED_DECODED_TEXT,

)

Would it be possible to have something similar for videos? Fine with this tho as well, just not sure if batching x videos could have some different issues than batching x image

tests/models/mistral/test_modeling_mistral.py

vasqu · 2025-08-25T15:40:28Z

tests/models/mistral/test_modeling_mistral.py

        model_id = "mistralai/Mistral-7B-v0.1"
        EXPECTED_COMPLETIONS = [
-            "This is a nice place. This is a nice place. This is a nice place. This is",
+            "scenery, scenery, scenery, scenery, scenery,",


Repetition goes wild 😆

tests/models/mistral/test_modeling_mistral.py

vasqu · 2025-08-25T15:41:31Z

tests/models/mistral/test_modeling_mistral.py

+        if attn_implementation in ["flex_attention", "eager"]:
+            input_text = input_text[:1]


Small comment to add ^

tests/models/qwen2/test_modeling_qwen2.py

vasqu · 2025-08-25T15:44:08Z

tests/models/qwen3/test_modeling_qwen3.py

    @pytest.mark.flash_attn_test
    def test_model_600m_long_prompt(self):
-        EXPECTED_OUTPUT_TOKEN_IDS = [306, 338]
+        EXPECTED_OUTPUT_TOKEN_IDS = [198, 198]


Maybe move to same test and parametrize instead? Looks like another parity check between sdpa and flash that was hidden

github-actions · 2025-08-26T08:28:17Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v, mistral, qwen2, qwen3

ydshieh added 8 commits August 25, 2025 15:17

up

33da2dc

up

c7694a3

up

5f25a16

up

3e4e7eb

up

3abc274

up

f879aa9

up

f2c8484

up

0b53541

ydshieh added 3 commits August 25, 2025 16:53

up

b28da68

up

58d9995

up

194ba10

ydshieh marked this pull request as ready for review August 25, 2025 15:07

up

a208436

ydshieh commented Aug 25, 2025

View reviewed changes

tests/models/glm4v/test_modeling_glm4v.py Show resolved Hide resolved

ydshieh commented Aug 25, 2025

View reviewed changes

tests/models/mistral/test_modeling_mistral.py Outdated Show resolved Hide resolved

ydshieh commented Aug 25, 2025

View reviewed changes

tests/models/qwen2/test_modeling_qwen2.py Show resolved Hide resolved

ydshieh commented Aug 25, 2025

View reviewed changes

ydshieh requested a review from vasqu August 25, 2025 15:25

Merge branch 'main' into fix_non_fa2_tests_after_fa2_in_images

8828b2e

vasqu approved these changes Aug 25, 2025

View reviewed changes

up

2c224b6

ydshieh merged commit 922e65b into main Aug 26, 2025
19 checks passed

ydshieh deleted the fix_non_fa2_tests_after_fa2_in_images branch August 26, 2025 08:36

	@slow
	def test_small_model_integration_test(self):
	model = Glm4vForConditionalGeneration.from_pretrained(
	"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"
	)

	inputs = self.processor.apply_chat_template(
	self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
	)
	expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip
	assert expected_input_ids == inputs.input_ids[0].tolist()[:17]

	expected_pixel_slice = torch.tensor(
	[
	[-0.0988, -0.0842, -0.0842],
	[-0.5660, -0.5514, -0.4200],
	[-0.0259, -0.0259, -0.0259],
	[-0.1280, -0.0988, -0.2010],
	[-0.4638, -0.5806, -0.6974],
	[-1.2083, -1.2229, -1.2083],
	],
	dtype=torch.float32,
	device="cpu",
	)
	assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3)

	# verify generation
	inputs = inputs.to(torch_device)

	output = model.generate(**inputs, max_new_tokens=30)
	EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
	self.assertEqual(
	self.processor.decode(output[0], skip_special_tokens=True),
	EXPECTED_DECODED_TEXT,
	)

	@slow
	def test_small_model_integration_test_batch(self):
	model = Glm4vForConditionalGeneration.from_pretrained(
	"THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto"
	)
	batch_messages = [self.message] * 2
	inputs = self.processor.apply_chat_template(
	batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
	).to(torch_device)

	# it should not matter whether two images are the same size or not
	output = model.generate(**inputs, max_new_tokens=30)

	EXPECTED_DECODED_TEXT = [
	"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
	"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
	] # fmt: skip
	self.assertEqual(
	self.processor.batch_decode(output, skip_special_tokens=True),
	EXPECTED_DECODED_TEXT,
	)

		if attn_implementation in ["flex_attention", "eager"]:
		input_text = input_text[:1]

Conversation

ydshieh commented Aug 25, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ydshieh Aug 26, 2025 •

edited

Loading