🐛 Bug
We observed a big increase in inference latency after adding torch.inverse() in the code path. After investigation and comparison with moving the op to CPU we found that there is a huge difference in performance of that op on GPU vs CPU.
The matrix size in our case is 4x4 which small for the GPU but torch.inverse() should be using magma library which has heuristics to move the op to CPU. We didn't see any magma invocations as well.
Based on @ptrblck suggestion we tried inverse on batched matrix to see if that made any difference but that also didn't result in any improvement.
Performance Difference:
╰─ python torch_inverse_exp.py
GPU Time: 8.78560 ms
CPU Time: 0.07149 ms
cpu/gpu: 122.90x
We increased the matrix dimensions to 1024x1024 but even then GPU is slower than CPU:
╰─ python torch_inverse_exp.py
GPU Time: 17.25289 ms
CPU Time: 9.54982 ms
cpu/gpu: 1.81x
To Reproduce
Here is the snippet of code which can be run to reproduce the performance difference:
import torch
import time
a = torch.randn(4,4).cuda()
torch.cuda.synchronize()
num_iter = 100
def test1():
s = time.time()
for i in range(num_iter):
x = torch.inverse(a)
torch.cuda.synchronize()
e = time.time()
gpu_time = ((e - s)/num_iter)
return gpu_time
def test2():
s = time.time()
for i in range(num_iter):
b = a.to('cpu')
d = torch.inverse(b)
y = d.to('cuda')
torch.cuda.synchronize()
e = time.time()
cpu_time = ((e - s)/num_iter)
return cpu_time
gpu_time = test1()
print("GPU Time: {:.5f} ms".format(gpu_time*1000))
cpu_time = test2()
print("CPU Time: {:.5f} ms".format( cpu_time*1000))
print("cpu/gpu: {:.2f}x ".format(gpu_time/cpu_time))
Run: python <script_name>
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Collecting environment information...
PyTorch version: 1.5.1 (pip package)
Is debug build: No
CUDA used to build PyTorch: 10.1 & 10.2 (NGC 20.03 container)
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
CMake version: version 3.17.3
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX
Nvidia driver version: 440.100
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[pip3] torchviz==0.0.1
[conda] _pytorch_select 0.2 gpu_0
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.1.0 py37h23d657b_0
[conda] mkl_random 1.1.1 py37h0573a6f_0
[conda] numpy 1.18.5 py37ha1c710e_0
[conda] numpy-base 1.18.5 py37hde5b4d6_0
[conda] numpydoc 1.1.0 py_0
[conda] pytorch 1.5.1 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchvision 0.6.1 py37_cu101 pytorch
[conda] torchviz 0.0.1 pypi_0 pypi
@ptrblck @ngimel for visibility
cc @ngimel @vincentqb @vishwakftw @ssnl @jianyuh @VitalyFedyunin
🐛 Bug
We observed a big increase in inference latency after adding
torch.inverse()in the code path. After investigation and comparison with moving the op to CPU we found that there is a huge difference in performance of that op on GPU vs CPU.The matrix size in our case is 4x4 which small for the GPU but
torch.inverse()should be usingmagmalibrary which has heuristics to move the op to CPU. We didn't see any magma invocations as well.Based on @ptrblck suggestion we tried
inverseon batched matrix to see if that made any difference but that also didn't result in any improvement.Performance Difference:
We increased the matrix dimensions to 1024x1024 but even then GPU is slower than CPU:
To Reproduce
Here is the snippet of code which can be run to reproduce the performance difference:
Run:
python <script_name>Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
@ptrblck @ngimel for visibility
cc @ngimel @vincentqb @vishwakftw @ssnl @jianyuh @VitalyFedyunin