Add sliding window attention to sdpa in mistral

### Feature request

https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L1006-L1023
![image](https://github.com/huggingface/transformers/assets/5137359/9601a5d2-cf9f-4ef6-a0ab-047a8cd7f1cd)

In the code listed above, the latest version of transformers cannot use sliding window feature in mistral model.
I doubt that the reason is you mentioned above,
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L687-L688
![image](https://github.com/huggingface/transformers/assets/5137359/997cd770-e17f-4eb4-997b-fe65f30ddc85)
And this issue in PyTorch makes you bugged with custom attn_mask like sliding window attention mask.
https://github.com/pytorch/pytorch/issues/112577

While this issue has been fixed since torch 2.2.0, and it has been released two weeks ago, can you add this feature back to sdpa kernel in mistral?

### Motivation

I cannot use sliding window with sdpa right now, cause my gpu card is V100, i cannot work with flashattention2.

### Your contribution

I think we can pass sliding_window param to _prepare_4d_causal_attention_mask_for_sdpa function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sliding window attention to sdpa in mistral #28980

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add sliding window attention to sdpa in mistral #28980

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions