I have been doing some profiling comparing cupy's sparse matrix dot() with PyTorch's mm() and I'm getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.
Strangely, I'm finding that it's the cusparse csrgemm_kernel from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.
I'm attaching the nvvp file with my profiling results. I'm pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.
>>> cupy.show_config()
CuPy Version : 6.5.0
CUDA Root : /usr/local/cuda
CUDA Build Version : 10000
CUDA Driver Version : 10000
CUDA Runtime Version : 10000
cuDNN Build Version : 7600
cuDNN Version : 7600
NCCL Build Version : 2406
NCCL Runtime Version : 2406
I have been doing some profiling comparing cupy's sparse matrix
dot()with PyTorch'smm()and I'm getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.Strangely, I'm finding that it's the cusparse
csrgemm_kernelfrom cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.I'm attaching the nvvp file with my profiling results. I'm pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.