The matrix multiplication operator can't get correct results on 3090 !!

## 🐛 Bug


# The matrix multiplication operator can't get correct results on 3090 !!

## To Reproduce
mini code sample:
```
import torch

device = torch.device("cuda:0")
cur_mat = torch.eye(3).to(device).unsqueeze(0)
cur_vec = torch.as_tensor([-0.6660, -0.2958,  8.6392]).view(3, 1).to(device).unsqueeze(0)

print("Mat")
print(cur_mat)
print("Vec")
print(cur_vec)
print("----------")
print("CPU: torch.bmm   : ", end="")
print(cur_mat.cpu().bmm(cur_vec.cpu()).view(3))
print("GPU: torch.bmm   : ", end="")
print(cur_mat.bmm(cur_vec).view(3))
print("GPU: torch.matmul: ", end="")
print(cur_mat.matmul(cur_vec).view(3))
print("GPU: torch.mm    : ", end="")
print(cur_mat[0].mm(cur_vec[0]).view(3))
```
output：
```
Mat
tensor([[[1., 0., 0.],
         [0., 1., 0.],
         [0., 0., 1.]]], device='cuda:0')
Vec
tensor([[[-0.6660],
         [-0.2958],
         [ 8.6392]]], device='cuda:0')
----------
CPU: torch.bmm   : tensor([-0.6660, -0.2958,  8.6392])
GPU: torch.bmm   : tensor([-0.6660, -0.2959,  8.6406], device='cuda:0')
GPU: torch.matmul: tensor([-0.6660, -0.2959,  8.6406], device='cuda:0')
GPU: torch.mm    : tensor([-0.6660, -0.2959,  8.6406], device='cuda:0')
```



## Expected behavior
because cur_mat is a identity mat, the output should be unchanged. 



## Environment

 - PyTorch Version (e.g., 1.0): '1.9.0+cu111'
 - GPU: **3090**
 - OS (e.g., Linux):ubuntu 18.04
 - How you installed PyTorch (`conda`, `pip`, source): pip
 - Build command you used (if compiling from source):
 - Python version:3.8.5
 - CUDA/cuDNN version:11.4


details
```
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.2
Libc version: glibc-2.27

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-77-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.4.48
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.42.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] pytorch3d==0.4.0
[pip3] torch==1.9.0+cu111
[pip3] torch-tb-profiler==0.2.1
[pip3] torchaudio==0.9.0
[pip3] torchgeometry==0.1.2
[pip3] torchsearchsorted==1.1
[pip3] torchvision==0.10.0+cu111
[conda] blas                      1.0                         mkl    defaults
[conda] cudatoolkit               11.0.221             h6bb024c_0    defaults
[conda] mkl                       2020.2                      256    defaults
[conda] mkl-service               2.3.0            py38he904b0f_0    defaults
[conda] mkl_fft                   1.2.0            py38h23d657b_0    defaults
[conda] mkl_random                1.1.1            py38h0573a6f_0    defaults
[conda] numpy                     1.19.2           py38h54aff64_0    defaults
[conda] numpy-base                1.19.2           py38hfa32c7d_0    defaults
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1    defaults
[conda] pytorch3d                 0.4.0                     dev_0    <develop>
[conda] torch                     1.9.0+cu111              pypi_0    pypi
[conda] torch-tb-profiler         0.2.1                    pypi_0    pypi
[conda] torchaudio                0.9.0                    pypi_0    pypi
[conda] torchgeometry             0.1.2                    pypi_0    pypi
[conda] torchsearchsorted         1.1                      pypi_0    pypi
[conda] torchvision               0.10.0+cu111             pypi_0    pypi
```
## Additional context
Only 3090 exists the above-mentioned problem.

Testing results show that pytorch with '1.7.1+cu110' verision can get the right result. However, when giving a tensor with a lagre batch size, bmm operator is also ubable to return the right result. 
code sample:
test data : [tt_dict.pkl](https://mailustceducn-my.sharepoint.com/:u:/g/personal/hymath_mail_ustc_edu_cn/EfOAfoqVdutPmp7nLk0KZ1EBveDeYCuj7v2_dhYkkFt6ZQ?e=rsvYuW)
```
import torch
import pickle as pkl


device = torch.device("cuda:0")
with open("tt_dict.pkl", "rb") as f:
    tt = pkl.load(f)
    
live_vps = tt["live_vps"].to(device)
batch_Rmats = tt["batch_Rmats"].to(device)

tt = batch_Rmats.permute(0, 2, 1)

print("Output:")
print("-----------")
print("bmm with large batch size:")
rott_vps = torch.bmm(live_vps, tt)
print(rott_vps[2, 0])
print("-----------")
print("bmm with mini batch size:")
print((live_vps[2:3, 0:1]).bmm(tt[2:3]))
print("-----------")
print("mm:")
print((live_vps[2, 0:1]).mm(tt[2]))
print("-----------")
print("cpu:")
print(((live_vps[2, 0:1]).cpu()).mm(tt[2].cpu()))
```

output:
```
Output:
-----------
bmm with large batch size:
tensor([-0.6660, -0.2957,  **8.6406**], device='cuda:0')
-----------
bmm with mini batch size:
tensor([[[-0.6660, -0.2958,  **8.6392**]]], device='cuda:0')
-----------
mm:
tensor([[-0.6660, -0.2958,  **8.6392**]], device='cuda:0')
-----------
cpu:
tensor([[-0.6660, -0.2958,  **8.6392**]])
```
 



cc @ngimel @zasdfgbnm @ptrblck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The matrix multiplication operator can't get correct results on 3090 !! #61890

🐛 Bug

The matrix multiplication operator can't get correct results on 3090 !!

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The matrix multiplication operator can't get correct results on 3090 !! #61890

Description

🐛 Bug

The matrix multiplication operator can't get correct results on 3090 !!

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions