Sparse matrix multiply orders of magnitude slower than PyTorch

I have been doing some profiling comparing cupy's sparse matrix `dot()` with PyTorch's `mm()` and I'm getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply. 

Strangely, I'm finding that it's the cusparse `csrgemm_kernel` from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end. 

I'm attaching the [nvvp file](https://github.com/cupy/cupy/files/3870573/cupy_vs_pytorch.nvvp.zip) with my profiling results. I'm pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.

```
>>> cupy.show_config()
CuPy Version          : 6.5.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7600
cuDNN Version         : 7600
NCCL Build Version    : 2406
NCCL Runtime Version  : 2406
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse matrix multiply orders of magnitude slower than PyTorch #2664

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sparse matrix multiply orders of magnitude slower than PyTorch #2664

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions