torch.inverse() performing very poorly on GPU vs CPU

## 🐛 Bug
We observed a big increase in inference latency after adding `torch.inverse()` in the code path. After investigation and comparison with moving the op to CPU we found that there is a huge difference in performance of that op on GPU vs CPU.
The matrix size in our case is 4x4 which small for the GPU but `torch.inverse()` should be using `magma` library which has heuristics to move the op to CPU. We didn't see any magma invocations as well.

Based on @ptrblck suggestion we tried `inverse` on batched matrix to see if that made any difference but that also didn't result in any improvement.

**Performance Difference:**
```
╰─ python torch_inverse_exp.py
GPU Time: 8.78560 ms
CPU Time: 0.07149 ms
cpu/gpu: 122.90x
```
We increased the matrix dimensions to 1024x1024 but even then GPU is slower than CPU:
```
╰─ python torch_inverse_exp.py
GPU Time: 17.25289 ms
CPU Time: 9.54982 ms
cpu/gpu: 1.81x
```
## To Reproduce
Here is the snippet of code which can be run to reproduce the performance difference:
```
import torch
import time

a = torch.randn(4,4).cuda()
torch.cuda.synchronize()

num_iter = 100
def test1():
    s = time.time()
    for i in range(num_iter):
        x = torch.inverse(a)
    torch.cuda.synchronize()
    e = time.time()
    gpu_time = ((e - s)/num_iter)
    return  gpu_time

def test2():
    s = time.time()
    for i in range(num_iter):
        b = a.to('cpu')
        d = torch.inverse(b)
        y = d.to('cuda')
    torch.cuda.synchronize()
    e = time.time()
    cpu_time = ((e - s)/num_iter)
    return  cpu_time

gpu_time = test1()
print("GPU Time: {:.5f} ms".format(gpu_time*1000))

cpu_time = test2()
print("CPU Time: {:.5f} ms".format( cpu_time*1000))

print("cpu/gpu: {:.2f}x ".format(gpu_time/cpu_time))
```
**Run:** `python <script_name>`

## Environment

Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
(or fill out the checklist below manually).

You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Collecting environment information...
PyTorch version: 1.5.1 (pip package)
Is debug build: No
CUDA used to build PyTorch: 10.1 & 10.2 (NGC 20.03 container)

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
CMake version: version 3.17.3

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX

Nvidia driver version: 440.100
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[pip3] torchviz==0.0.1
[conda] _pytorch_select           0.2                       gpu_0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h6bb024c_0
[conda] mkl                       2020.1                      217
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.1.0            py37h23d657b_0
[conda] mkl_random                1.1.1            py37h0573a6f_0
[conda] numpy                     1.18.5           py37ha1c710e_0
[conda] numpy-base                1.18.5           py37hde5b4d6_0
[conda] numpydoc                  1.1.0                      py_0
[conda] pytorch                   1.5.1           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.6.1                py37_cu101    pytorch
[conda] torchviz                  0.0.1                    pypi_0    pypi
```

@ptrblck @ngimel for visibility

cc @ngimel @vincentqb @vishwakftw @SsnL @jianyuh @VitalyFedyunin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.inverse() performing very poorly on GPU vs CPU #42265

🐛 Bug

To Reproduce

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torch.inverse() performing very poorly on GPU vs CPU #42265

Description

🐛 Bug

To Reproduce

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions