Hello,
I mention this issue earlier in #2365 as performance issue; the bug was fixed but performance is still an issue. When I perform a sparse matrix vector multiply using CSR format, it is over 200 times slower the first time it is computed and it is always over 200 times slower when multiplying a sparse matrix in CSC format with a vector from the left hand side. Internally cuda should perform the same operation, and there should be no need to re-sort the entries twice.
The attached sparse_test.py performs the same computation in different ways, compares with scipy and csc/csr formats, using a sparse matrix with dimension (10k by 10k) 10M non-zero elements giving the following output:
all cpu-gpu close?: True True
all lm/rm close? : True True
all 1st/2nd close?: True True
time scipy lm/rm :0.00446,0.00461
wall time cupy lm (1st and 2nd time) :0.0686, 0.00024
wall time cupy rm (1st and 2nd time) :0.0594, 0.061
cuda time cupy lm (1st and 2nd time):0.0594, 0.000229
cuda time cupy lm (1st and 2nd time):0.0593, 0.061
The performance issue goes away if modify cupyx/scipy/sparse/coo.py as attached by setting _has_canonical_format=True after sorting. I am not sure if it has other unintended consequences, but I have not observed any issues in my code.
sparse_test.py.txt
coo.py.txt
configuration: ubuntu 18.04, cuda 10.1, rtx-titan gpu
CuPy Version : 7.0.0
CUDA Root : /usr/local/cuda-10.1
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10201
cuFFT Version : 10101
cuRAND Version : 10101
cuSOLVER Version : (10, 2, 0)
cuSPARSE Version : 10300
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : 2402
NCCL Runtime Version : 2402
Hello,
I mention this issue earlier in #2365 as performance issue; the bug was fixed but performance is still an issue. When I perform a sparse matrix vector multiply using CSR format, it is over 200 times slower the first time it is computed and it is always over 200 times slower when multiplying a sparse matrix in CSC format with a vector from the left hand side. Internally cuda should perform the same operation, and there should be no need to re-sort the entries twice.
The attached sparse_test.py performs the same computation in different ways, compares with scipy and csc/csr formats, using a sparse matrix with dimension (10k by 10k) 10M non-zero elements giving the following output:
all cpu-gpu close?: True True
all lm/rm close? : True True
all 1st/2nd close?: True True
time scipy lm/rm :0.00446,0.00461
wall time cupy lm (1st and 2nd time) :0.0686, 0.00024
wall time cupy rm (1st and 2nd time) :0.0594, 0.061
cuda time cupy lm (1st and 2nd time):0.0594, 0.000229
cuda time cupy lm (1st and 2nd time):0.0593, 0.061
The performance issue goes away if modify cupyx/scipy/sparse/coo.py as attached by setting _has_canonical_format=True after sorting. I am not sure if it has other unintended consequences, but I have not observed any issues in my code.
sparse_test.py.txt
coo.py.txt
configuration: ubuntu 18.04, cuda 10.1, rtx-titan gpu
CuPy Version : 7.0.0
CUDA Root : /usr/local/cuda-10.1
CUDA Build Version : 10010
CUDA Driver Version : 10010
CUDA Runtime Version : 10010
cuBLAS Version : 10201
cuFFT Version : 10101
cuRAND Version : 10101
cuSOLVER Version : (10, 2, 0)
cuSPARSE Version : 10300
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : 2402
NCCL Runtime Version : 2402