-
Notifications
You must be signed in to change notification settings - Fork 27.1k
Description
🐛 Describe the bug
With torch v2.1, scaled_dot_product_attention on GPU gives nan when a sequence has all large negative values (e.g torch.finfo(q.dtype).min - in order to mean no attention at all places). On CPU, it won't give nan.
With torch v2.0, it gives no nan on both CPU and GPU and those values are the same as the one given by v2.1 + CPU.
I understand it doesn't really make sense when a sequence has no place to attend attention. However, I am wondering if this nan value in torch v2.1 is intentional or unexpected.
This causes issues falcon implementation in transformers when left padding is used.
Reproduce
(running with torch v2.1)
import torch
from transformers import FalconModel
from torch.nn import functional as F
torch.manual_seed(0)
a = 3
b = 4
q = torch.randn(size=(1, 1, a, b))
k = torch.randn(size=(1, 1, a, b))
v = torch.randn(size=(1, 1, a, b))
def check(q, k, v, device):
q = q.to(device)
k = k.to(device)
v = v.to(device)
neg_value = torch.finfo(q.dtype).min
mask = [[neg_value, neg_value, neg_value], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
mask = torch.tensor([[mask]]).to(device)
o = F.scaled_dot_product_attention(q, k, v, mask, 0.0, is_causal=False)
print(o)
check(q, k, v, "cpu")
check(q, k, v, "cuda")Outputs
- with torch v2.0 (both
CPUandGPU) or torch v2.1 (CPU)
tensor([[[[ 0.1210, 0.3627, -0.9969, -0.6149],
[ 0.1295, 0.4572, -1.0491, -0.6166],
[ 0.1095, 0.3819, -0.7369, -0.8267]]]])- torch v2.1 (
GPU)
tensor([[[[ nan, nan, nan, nan],
[ 0.1295, 0.4572, -1.0491, -0.6166],
[ 0.1095, 0.3819, -0.7369, -0.8267]]]], device='cuda:0')
Versions
Collecting environment information...
PyTorch version: 2.1.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.8.16 (default, Jun 12 2023, 21:00:42) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 11.6.112
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU
Nvidia driver version: 517.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=2400
DeviceID=CPU0
Family=198
L2CacheSize=11776
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2400
Name=12th Gen Intel(R) Core(TM) i7-12800H
ProcessorType=3
Revision=
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] torch==2.1.0+cu118
[pip3] torchaudio==2.1.0+cu118
[pip3] torchvision==0.16.0+cu118
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 2.1.0+cu118 pypi_0 pypi
[conda] torchaudio 2.1.0+cu118 pypi_0 pypi
[conda] torchvision 0.16.0+cu118 pypi_0 pypi
### Tasks