Skip to content

Sparse matrix multiply orders of magnitude slower than PyTorch #2664

@cjnolet

Description

@cjnolet

I have been doing some profiling comparing cupy's sparse matrix dot() with PyTorch's mm() and I'm getting some very surprising results. I noticed that while cupy is using cusparse, PyTorch appears to be using only thrust/cub to do the multiply.

Strangely, I'm finding that it's the cusparse csrgemm_kernel from cupy that is dominating the runtime of the multiply, taking 40ms. The end-to-end matrix multiply is taking upwards of 70ms for cupy. The PyTorch implementation is taking a little over 2ms to run the multiply end-to-end.

I'm attaching the nvvp file with my profiling results. I'm pretty surprised. The algorithm being profiled is a simple Multinomial Naive Bayes using the 20-newsgroups dataset from scikit-learn. I trained both algorithms twice in order to eliminate jit and cuda context creation from the comparison.

>>> cupy.show_config()
CuPy Version          : 6.5.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10000
CUDA Driver Version   : 10000
CUDA Runtime Version  : 10000
cuDNN Build Version   : 7600
cuDNN Version         : 7600
NCCL Build Version    : 2406
NCCL Runtime Version  : 2406

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions