[Mistral&Mixtral]Add sliding window for sdpa by ehuaa · Pull Request #29407 · huggingface/transformers

ehuaa · 2024-03-02T11:18:09Z

@ArthurZucker Arthur has reviewed before, but my git changes log is weird, so i open a new pr instead. I uploaded a new test for slidingwindow flash vs sdpa for checking.

Superseeds #29220

…com/https://github.com/ehuaa/transformers into add_sliding_window_for_sdpa

ArthurZucker · 2024-03-05T05:28:05Z

tests/models/mistral/test_modeling_mistral.py

Thanks! Let's throw in a generation tests as well and we should be good to go! 🤗

Thanks! Let's throw in a generation tests as well and we should be good to go! 🤗

Ok, and the test flash vs sdpa i submitted above cannot pass the tests, have you debugged with it? I'm also curious about the reason why it failed here.

No I have not debugged it, I won't have the bandwidth, do you need help on this? cc @younesbelkada I think that this is pretty important

No I have not debugged it, I won't have the bandwidth, do you need help on this? cc @younesbelkada I think that this is pretty important

and the generation test you mentioned above i think test_model_7b_long_prompt_sdpa is enough, it contains generation with sdpa and sliding window.

No I have not debugged it, I won't have the bandwidth, do you need help on this? cc @younesbelkada I think that this is pretty important

And i see that https://github.com/huggingface/transformers/blob/main/tests/models/gemma/test_modeling_gemma.py#L471 gemma has a similar sdpa logits test as i committed. I think they have passed this test, maybe it can help with the debug.

ArthurZucker

Late but glad we waited!
The _prepare_4d_causal_attention_mask_for_sdpa does not seem to fair well with sliding_window when there is no mask. Let's add one more full generation tets similar to test_model_7b_logits_long_with_sdpa_and_flash2 but generating!

ArthurZucker · 2024-03-27T06:13:04Z

tests/models/mistral/test_modeling_mistral.py

+        model = MistralForCausalLM.from_pretrained(
+            "mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="flash_attention_2"
+        )
+        input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)
+        with torch.no_grad():
+            out = model(input_ids).logits.cpu()
+
+        input_ids = [1] + [306, 338] * 2048
+        model = MistralForCausalLM.from_pretrained(
+            "mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="sdpa"
+        )


Suggested change

model = MistralForCausalLM.from_pretrained(

"mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="flash_attention_2"

)

input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)

with torch.no_grad():

out = model(input_ids).logits.cpu()

input_ids = [1] + [306, 338] * 2048

model = MistralForCausalLM.from_pretrained(

"mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="sdpa"

)

model = MistralForCausalLM.from_pretrained(

"mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16

)

input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)

with torch.no_grad():

out = model(input_ids).logits.cpu()

input_ids = [1] + [306, 338] * 2048

model = MistralForCausalLM.from_pretrained(

"mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="sdpa", torch_dtype=torch.bfloat16

)

I am getting an error because by default it seems to be float32.

this passes for me

ArthurZucker · 2024-03-27T06:23:29Z

tests/models/mistral/test_modeling_mistral.py

+        with torch.no_grad():
+            out = model(input_ids).logits.cpu()
+
+        input_ids = [1] + [306, 338] * 2048


Suggested change

input_ids = [1] + [306, 338] * 2048

ArthurZucker · 2024-03-27T06:23:34Z

tests/models/mistral/test_modeling_mistral.py

+        model = MistralForCausalLM.from_pretrained(
+            "mistralai/Mistral-7B-v0.1", device_map="auto", attn_implementation="sdpa"
+        )
+        input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)


Suggested change

input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)

ArthurZucker · 2024-03-27T06:25:16Z

tests/models/mistral/test_modeling_mistral.py

+        input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device)
+        with torch.no_grad():
+            out1 = model(input_ids).logits.cpu()
+        torch.testing.assert_close(out.mean(-1), out1.mean(-1), atol=1e-2, rtol=1e-2)


let's make sure we test all logits not just the mean

Suggested change

torch.testing.assert_close(out.mean(-1), out1.mean(-1), atol=1e-2, rtol=1e-2)

torch.testing.assert_close(out, out1, atol=1e-4, rtol=1e-4)

with this, the test is failing:

> torch.testing.assert_close(out, out1, atol=1e-4, rtol=1e-4) E AssertionError: Tensor-likes are not close! E E Mismatched elements: 90967735 / 131104000 (69.4%) E Greatest absolute difference: 0.328125 at index (0, 2310, 338) (up to 0.0001 allowed) E Greatest relative difference: 114689.0 at index (0, 1267, 4581) (up to 0.0001 allowed)

ArthurZucker · 2024-03-27T06:30:19Z

src/transformers/models/mixtral/modeling_mixtral.py

                (batch_size, seq_length),
                inputs_embeds,
                past_key_values_length,
+                sliding_window=self.config.sliding_window if is_torch_version_greater_or_equal_than_2_2_0 else None,


The issue here is that _prepare_4d_causal_attention_mask_for_sdpa seems to return None if attention_mask is None (which is the case in the test) while if we actually want to use sliding we need to return the full causal mask. cc @fxmarty

ArthurZucker · 2024-03-30T19:22:44Z

@fxmarty if you want to take over in a new PR, this is fairly important IMO

cyr0930 · 2024-04-08T00:46:43Z

This PL will solve #28980

github-actions · 2024-05-02T08:04:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fxmarty · 2024-05-02T15:59:46Z

closing as #30127 was merged and takes inspiration from this PR

ehuaa added 21 commits February 22, 2024 21:36

add sliding window param to sdpa after torch==2.2.0

97cff89

add sliding window param to sdpa after torch==2.2.0

d0baf19

revert add sliding window for qwen2 because of numerical error

c9dacb8

Merge branch 'add_sliding_window_for_sdpa' of https://mirror.ghproxy.…

9c7bb07

…com/https://github.com/ehuaa/transformers into add_sliding_window_for_sdpa

remove adding sliding_window param to qwen2 because of numerical error

f464d15

fix style

f4c21d0

only add non-contigous mask to qwen2 due to numerical error

a9f1571

revert non-contigous tensor modification

8b84d68

move torch version judgement to import_utils for usability

6bee407

Merge branch 'main' into add_sliding_window_for_sdpa

cf225ff

delete deprecated is_flash_attn in import_utils.py

9c4f0b0

add sliding window param to sdpa after torch==2.2.0

773d8c8

revert add sliding window for qwen2 because of numerical error

4fab890

add sliding window param to sdpa after torch==2.2.0

074d47a

remove adding sliding_window param to qwen2 because of numerical error

6972cdf

fix style

8611c2d

only add non-contigous mask to qwen2 due to numerical error

510f24f

revert non-contigous tensor modification

cbfc413

move torch version judgement to import_utils for usability

6d590b0

upload a test for compare flash vs sdpa for sliding window in Mistral

da327f7

Merge branch 'add_sliding_window_for_sdpa' of https://mirror.ghproxy.…

8171137

…com/https://github.com/ehuaa/transformers into add_sliding_window_for_sdpa

ehuaa mentioned this pull request Mar 5, 2024

[Mistral&Mixtral]Add sliding window param to sdpa after torch 2.2.0 #29220

Closed

2 tasks

ArthurZucker reviewed Mar 5, 2024

View reviewed changes

gante mentioned this pull request Mar 8, 2024

Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

Closed

This was referenced Mar 21, 2024

Add sliding window attention to sdpa in mistral #28980

Closed

MistralAttention: where is the sliding window #29777

Closed

ArthurZucker reviewed Mar 27, 2024

View reviewed changes

ArthurZucker added the SDPA label Mar 30, 2024

fxmarty mentioned this pull request Apr 8, 2024

Fix SDPA sliding window compatibility #30127

Merged

fxmarty closed this May 2, 2024

	torch.testing.assert_close(out.mean(-1), out1.mean(-1), atol=1e-2, rtol=1e-2)
	torch.testing.assert_close(out, out1, atol=1e-4, rtol=1e-4)

Conversation

ehuaa commented Mar 2, 2024 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Mar 30, 2024

Uh oh!

cyr0930 commented Apr 8, 2024

Uh oh!

github-actions bot commented May 2, 2024

Uh oh!

fxmarty commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ehuaa commented Mar 2, 2024 •

edited by ArthurZucker

Loading