`scaled_dot_product_attention` behaves differently between v2.0 and v2.1

### 🐛 Describe the bug

With torch v2.1, `scaled_dot_product_attention` on `GPU` gives `nan` when a sequence has all large negative values (e.g `torch.finfo(q.dtype).min` - in order to mean no attention at all places). On `CPU`, it won't give `nan`.

With torch v2.0, it gives no `nan` on both `CPU` and `GPU` and those values are the same as the one given by `v2.1 + CPU`.

I understand it doesn't really make sense when a sequence has no place to attend attention. However, I am wondering **if this `nan` value in torch v2.1 is intentional or unexpected**.

This causes issues `falcon` implementation in `transformers` when left padding is used.

### Reproduce
(running with torch v2.1)
```python
import torch
from transformers import FalconModel
from torch.nn import functional as F

torch.manual_seed(0)

a = 3
b = 4

q = torch.randn(size=(1, 1, a, b))
k = torch.randn(size=(1, 1, a, b))
v = torch.randn(size=(1, 1, a, b))

def check(q, k, v, device):

    q = q.to(device)
    k = k.to(device)
    v = v.to(device)

    neg_value = torch.finfo(q.dtype).min
    mask = [[neg_value, neg_value, neg_value], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
    mask = torch.tensor([[mask]]).to(device)

    o = F.scaled_dot_product_attention(q, k, v, mask, 0.0, is_causal=False)
    print(o)

check(q, k, v, "cpu")
check(q, k, v, "cuda")
```

### Outputs

- with torch v2.0 (both `CPU` and `GPU`) or torch v2.1 (`CPU`)
```bash
tensor([[[[ 0.1210,  0.3627, -0.9969, -0.6149],
          [ 0.1295,  0.4572, -1.0491, -0.6166],
          [ 0.1095,  0.3819, -0.7369, -0.8267]]]])
```
- torch v2.1 (`GPU`)
```
tensor([[[[    nan,     nan,     nan,     nan],
          [ 0.1295,  0.4572, -1.0491, -0.6166],
          [ 0.1095,  0.3819, -0.7369, -0.8267]]]], device='cuda:0')
```

### Versions

Collecting environment information...
PyTorch version: 2.1.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.8.16 (default, Jun 12 2023, 21:00:42) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 11.6.112
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU
Nvidia driver version: 517.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2400
DeviceID=CPU0
Family=198
L2CacheSize=11776
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2400
Name=12th Gen Intel(R) Core(TM) i7-12800H
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] torch==2.1.0+cu118
[pip3] torchaudio==2.1.0+cu118
[pip3] torchvision==0.16.0+cu118
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] torch                     2.1.0+cu118              pypi_0    pypi
[conda] torchaudio                2.1.0+cu118              pypi_0    pypi
[conda] torchvision               0.16.0+cu118             pypi_0    pypi
```[tasklist]
### Tasks
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`scaled_dot_product_attention` behaves differently between v2.0 and v2.1 #110213

🐛 Describe the bug

Reproduce

Outputs

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scaled_dot_product_attention behaves differently between v2.0 and v2.1 #110213

Description

🐛 Describe the bug

Reproduce

Outputs

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`scaled_dot_product_attention` behaves differently between v2.0 and v2.1 #110213