use_flash_attention_2=True for Llama2 breaks generation

### System Info

- `transformers` version: 4.34.0
- Platform: Linux-5.15.0-1042-oracle-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.2
- Accelerate version: 0.21.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no


### Who can help?

text models: @ArthurZucker and @younesbelkada


### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Using flash attention 2 completely breaks generation. 

<img width="651" alt="image" src="https://github.com/huggingface/transformers/assets/22663468/22384c4a-0aa4-4a51-acc7-805379a2b72f">


### Expected behavior

Generations match

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_flash_attention_2=True for Llama2 breaks generation #26697

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

use_flash_attention_2=True for Llama2 breaks generation #26697

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions