Skip to content

torch.inverse() performing very poorly on GPU vs CPU #42265

@sandeepkumar-skb

Description

@sandeepkumar-skb

🐛 Bug

We observed a big increase in inference latency after adding torch.inverse() in the code path. After investigation and comparison with moving the op to CPU we found that there is a huge difference in performance of that op on GPU vs CPU.
The matrix size in our case is 4x4 which small for the GPU but torch.inverse() should be using magma library which has heuristics to move the op to CPU. We didn't see any magma invocations as well.

Based on @ptrblck suggestion we tried inverse on batched matrix to see if that made any difference but that also didn't result in any improvement.

Performance Difference:

╰─ python torch_inverse_exp.py
GPU Time: 8.78560 ms
CPU Time: 0.07149 ms
cpu/gpu: 122.90x

We increased the matrix dimensions to 1024x1024 but even then GPU is slower than CPU:

╰─ python torch_inverse_exp.py
GPU Time: 17.25289 ms
CPU Time: 9.54982 ms
cpu/gpu: 1.81x

To Reproduce

Here is the snippet of code which can be run to reproduce the performance difference:

import torch
import time

a = torch.randn(4,4).cuda()
torch.cuda.synchronize()

num_iter = 100
def test1():
    s = time.time()
    for i in range(num_iter):
        x = torch.inverse(a)
    torch.cuda.synchronize()
    e = time.time()
    gpu_time = ((e - s)/num_iter)
    return  gpu_time

def test2():
    s = time.time()
    for i in range(num_iter):
        b = a.to('cpu')
        d = torch.inverse(b)
        y = d.to('cuda')
    torch.cuda.synchronize()
    e = time.time()
    cpu_time = ((e - s)/num_iter)
    return  cpu_time

gpu_time = test1()
print("GPU Time: {:.5f} ms".format(gpu_time*1000))

cpu_time = test2()
print("CPU Time: {:.5f} ms".format( cpu_time*1000))

print("cpu/gpu: {:.2f}x ".format(gpu_time/cpu_time))

Run: python <script_name>

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Collecting environment information...
PyTorch version: 1.5.1 (pip package)
Is debug build: No
CUDA used to build PyTorch: 10.1 & 10.2 (NGC 20.03 container)

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
CMake version: version 3.17.3

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX

Nvidia driver version: 440.100
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[pip3] torchviz==0.0.1
[conda] _pytorch_select           0.2                       gpu_0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h6bb024c_0
[conda] mkl                       2020.1                      217
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.1.0            py37h23d657b_0
[conda] mkl_random                1.1.1            py37h0573a6f_0
[conda] numpy                     1.18.5           py37ha1c710e_0
[conda] numpy-base                1.18.5           py37hde5b4d6_0
[conda] numpydoc                  1.1.0                      py_0
[conda] pytorch                   1.5.1           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchvision               0.6.1                py37_cu101    pytorch
[conda] torchviz                  0.0.1                    pypi_0    pypi

@ptrblck @ngimel for visibility

cc @ngimel @vincentqb @vishwakftw @ssnl @jianyuh @VitalyFedyunin

Metadata

Metadata

Assignees

Labels

module: cudaRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions