🐛 Bug
The matrix multiplication operator can't get correct results on 3090 !!
To Reproduce
mini code sample:
import torch
device = torch.device("cuda:0")
cur_mat = torch.eye(3).to(device).unsqueeze(0)
cur_vec = torch.as_tensor([-0.6660, -0.2958, 8.6392]).view(3, 1).to(device).unsqueeze(0)
print("Mat")
print(cur_mat)
print("Vec")
print(cur_vec)
print("----------")
print("CPU: torch.bmm : ", end="")
print(cur_mat.cpu().bmm(cur_vec.cpu()).view(3))
print("GPU: torch.bmm : ", end="")
print(cur_mat.bmm(cur_vec).view(3))
print("GPU: torch.matmul: ", end="")
print(cur_mat.matmul(cur_vec).view(3))
print("GPU: torch.mm : ", end="")
print(cur_mat[0].mm(cur_vec[0]).view(3))
output:
Mat
tensor([[[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]]], device='cuda:0')
Vec
tensor([[[-0.6660],
[-0.2958],
[ 8.6392]]], device='cuda:0')
----------
CPU: torch.bmm : tensor([-0.6660, -0.2958, 8.6392])
GPU: torch.bmm : tensor([-0.6660, -0.2959, 8.6406], device='cuda:0')
GPU: torch.matmul: tensor([-0.6660, -0.2959, 8.6406], device='cuda:0')
GPU: torch.mm : tensor([-0.6660, -0.2959, 8.6406], device='cuda:0')
Expected behavior
because cur_mat is a identity mat, the output should be unchanged.
Environment
- PyTorch Version (e.g., 1.0): '1.9.0+cu111'
- GPU: 3090
- OS (e.g., Linux):ubuntu 18.04
- How you installed PyTorch (
conda, pip, source): pip
- Build command you used (if compiling from source):
- Python version:3.8.5
- CUDA/cuDNN version:11.4
details
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.2
Libc version: glibc-2.27
Python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-77-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.4.48
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090
Nvidia driver version: 470.42.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] pytorch3d==0.4.0
[pip3] torch==1.9.0+cu111
[pip3] torch-tb-profiler==0.2.1
[pip3] torchaudio==0.9.0
[pip3] torchgeometry==0.1.2
[pip3] torchsearchsorted==1.1
[pip3] torchvision==0.10.0+cu111
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 11.0.221 h6bb024c_0 defaults
[conda] mkl 2020.2 256 defaults
[conda] mkl-service 2.3.0 py38he904b0f_0 defaults
[conda] mkl_fft 1.2.0 py38h23d657b_0 defaults
[conda] mkl_random 1.1.1 py38h0573a6f_0 defaults
[conda] numpy 1.19.2 py38h54aff64_0 defaults
[conda] numpy-base 1.19.2 py38hfa32c7d_0 defaults
[conda] numpydoc 1.1.0 pyhd3eb1b0_1 defaults
[conda] pytorch3d 0.4.0 dev_0 <develop>
[conda] torch 1.9.0+cu111 pypi_0 pypi
[conda] torch-tb-profiler 0.2.1 pypi_0 pypi
[conda] torchaudio 0.9.0 pypi_0 pypi
[conda] torchgeometry 0.1.2 pypi_0 pypi
[conda] torchsearchsorted 1.1 pypi_0 pypi
[conda] torchvision 0.10.0+cu111 pypi_0 pypi
Additional context
Only 3090 exists the above-mentioned problem.
Testing results show that pytorch with '1.7.1+cu110' verision can get the right result. However, when giving a tensor with a lagre batch size, bmm operator is also ubable to return the right result.
code sample:
test data : tt_dict.pkl
import torch
import pickle as pkl
device = torch.device("cuda:0")
with open("tt_dict.pkl", "rb") as f:
tt = pkl.load(f)
live_vps = tt["live_vps"].to(device)
batch_Rmats = tt["batch_Rmats"].to(device)
tt = batch_Rmats.permute(0, 2, 1)
print("Output:")
print("-----------")
print("bmm with large batch size:")
rott_vps = torch.bmm(live_vps, tt)
print(rott_vps[2, 0])
print("-----------")
print("bmm with mini batch size:")
print((live_vps[2:3, 0:1]).bmm(tt[2:3]))
print("-----------")
print("mm:")
print((live_vps[2, 0:1]).mm(tt[2]))
print("-----------")
print("cpu:")
print(((live_vps[2, 0:1]).cpu()).mm(tt[2].cpu()))
output:
Output:
-----------
bmm with large batch size:
tensor([-0.6660, -0.2957, **8.6406**], device='cuda:0')
-----------
bmm with mini batch size:
tensor([[[-0.6660, -0.2958, **8.6392**]]], device='cuda:0')
-----------
mm:
tensor([[-0.6660, -0.2958, **8.6392**]], device='cuda:0')
-----------
cpu:
tensor([[-0.6660, -0.2958, **8.6392**]])
cc @ngimel @zasdfgbnm @ptrblck
🐛 Bug
The matrix multiplication operator can't get correct results on 3090 !!
To Reproduce
mini code sample:
output:
Expected behavior
because cur_mat is a identity mat, the output should be unchanged.
Environment
conda,pip, source): pipdetails
Additional context
Only 3090 exists the above-mentioned problem.
Testing results show that pytorch with '1.7.1+cu110' verision can get the right result. However, when giving a tensor with a lagre batch size, bmm operator is also ubable to return the right result.
code sample:
test data : tt_dict.pkl
output:
cc @ngimel @zasdfgbnm @ptrblck