CUB-based CSR sparse matrix vector multiply by grlee77 · Pull Request #2698 · cupy/cupy

grlee77 · 2019-11-24T16:03:00Z

I was curious whether CUB would also improve the performance of sparse matrix multiplication. It seems that for < ~ 1,000,000 non-zero entries it is faster. At sizes larger than that, it is basically tied with cuSPARSE. A bit disappointing that the >3 fold gains at smaller sizes don't hold up for the larger problems of interest, although would be curious if this is also true for higher-end NVIDIA cards. (I tested with a GTX 1080 Ti)

The main limitation is that CUB only has this one sparse matrix function so it is limited to CSR only. Aside from performance, I think the other feature it has over cuSPARSE is that it can be used for additional dtypes such as int that are not supported by cuSPARSE. This can be done with the undocumented device_spmv function here, but the end-user interface is via the existing __mul__ method of csr_matrx and is thus limited to floating point types.

I will post a benchmark below.

grlee77 · 2019-11-24T16:36:28Z

Here is a benchmark for randomly distributed non-zero values. The last value in the output is the acceleration factor.

from time import time

import cupy as cp
import numpy as np
import scipy.stats
from math import ceil

from cupyx.scipy.sparse import csr_matrix, csc_matrix
from cupy.cuda.cub import device_csrmv

d = cp.cuda.Device()

for dtype in [np.float32, np.float64, np.complex64, np.complex128]:
    for shape in [(100, 100), (1000, 1000), (10000, 10000)]:
        for percent_nnz in [0.1, 1, 5, 10]:  # different levels of sparsity (% of non-zero values)
            mask = cp.random.randn(*shape) > scipy.stats.norm.ppf(1 - percent_nnz / 100)
            while mask.sum() == 0:
                # may need to retry to get at least one non-zero entries for small sizes
                mask = cp.random.randn(*shape) > scipy.stats.norm.ppf(1 - percent_nnz / 100)
            # print("actual percent nnz = {}".format(mask.sum() / mask.size))

            # mask a dense matrix of desired density to create a sparse matrix
            A = csr_matrix(cp.arange(mask.size, dtype=dtype).reshape(mask.shape) * mask)
            x = cp.ones(A.shape[1], dtype=A.dtype)
            if x.dtype.kind == 'c':
                x = x + 1j * x

            # set reps to last ~1 second based on cuSPARSE duration
            cp.cuda.cub_enabled = False
            tstart = time()
            dur1 = A * x
            d.synchronize()
            reps = max(ceil(1.0 / (time() - tstart)), 4)

            # disable CUB to test using cuSPARSE
            cp.cuda.cub_enabled = False
            tstart = time()
            for n in range(reps):
                y = A * x
            d.synchronize()
            duration_cusparse = (time() - tstart) / reps * 1000  # ms

            # enable CUB
            cp.cuda.cub_enabled = True
            tstart = time()
            for n in range(reps):
                y2 = A * x
            d.synchronize()
            duration_cub = (time() - tstart) / reps * 1000  # ms

            print(f"{np.dtype(dtype).name}, shape={shape}, % nnz={percent_nnz}, cusparse: {duration_cusparse} ms, cub: {duration_cub} ms, accel. ratio = {duration_cusparse/duration_cub}")

            cp.testing.assert_allclose(y, y2, atol=1e-5, rtol=1e-5)

float32, shape=(100, 100), % nnz=0.1, cusparse: 0.059 ms, cub: 0.018 ms, accel. ratio = 3.244
float32, shape=(100, 100), % nnz=1, cusparse: 0.060 ms, cub: 0.017 ms, accel. ratio = 3.446
float32, shape=(100, 100), % nnz=5, cusparse: 0.062 ms, cub: 0.018 ms, accel. ratio = 3.367
float32, shape=(100, 100), % nnz=10, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.020
float32, shape=(1000, 1000), % nnz=0.1, cusparse: 0.063 ms, cub: 0.021 ms, accel. ratio = 2.961
float32, shape=(1000, 1000), % nnz=1, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.072
float32, shape=(1000, 1000), % nnz=5, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 2.996
float32, shape=(1000, 1000), % nnz=10, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.025
float32, shape=(10000, 10000), % nnz=0.1, cusparse: 0.065 ms, cub: 0.022 ms, accel. ratio = 2.917
float32, shape=(10000, 10000), % nnz=1, cusparse: 0.065 ms, cub: 0.052 ms, accel. ratio = 1.242
float32, shape=(10000, 10000), % nnz=5, cusparse: 0.212 ms, cub: 0.200 ms, accel. ratio = 1.064
float32, shape=(10000, 10000), % nnz=10, cusparse: 0.311 ms, cub: 0.297 ms, accel. ratio = 1.045
float64, shape=(100, 100), % nnz=0.1, cusparse: 0.061 ms, cub: 0.018 ms, accel. ratio = 3.285
float64, shape=(100, 100), % nnz=1, cusparse: 0.066 ms, cub: 0.018 ms, accel. ratio = 3.626
float64, shape=(100, 100), % nnz=5, cusparse: 0.063 ms, cub: 0.021 ms, accel. ratio = 3.016
float64, shape=(100, 100), % nnz=10, cusparse: 0.063 ms, cub: 0.021 ms, accel. ratio = 3.051
float64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.065 ms, cub: 0.021 ms, accel. ratio = 3.093
float64, shape=(1000, 1000), % nnz=1, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.061
float64, shape=(1000, 1000), % nnz=5, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 2.976
float64, shape=(1000, 1000), % nnz=10, cusparse: 0.064 ms, cub: 0.022 ms, accel. ratio = 2.915
float64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.065 ms, cub: 0.022 ms, accel. ratio = 2.958
float64, shape=(10000, 10000), % nnz=1, cusparse: 0.067 ms, cub: 0.063 ms, accel. ratio = 1.065
float64, shape=(10000, 10000), % nnz=5, cusparse: 0.223 ms, cub: 0.225 ms, accel. ratio = 0.992
float64, shape=(10000, 10000), % nnz=10, cusparse: 0.383 ms, cub: 0.387 ms, accel. ratio = 0.988
complex64, shape=(100, 100), % nnz=0.1, cusparse: 0.063 ms, cub: 0.018 ms, accel. ratio = 3.503
complex64, shape=(100, 100), % nnz=1, cusparse: 0.062 ms, cub: 0.018 ms, accel. ratio = 3.389
complex64, shape=(100, 100), % nnz=5, cusparse: 0.069 ms, cub: 0.021 ms, accel. ratio = 3.300
complex64, shape=(100, 100), % nnz=10, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.055
complex64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.065 ms, cub: 0.021 ms, accel. ratio = 3.096
complex64, shape=(1000, 1000), % nnz=1, cusparse: 0.063 ms, cub: 0.021 ms, accel. ratio = 3.014
complex64, shape=(1000, 1000), % nnz=5, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.013
complex64, shape=(1000, 1000), % nnz=10, cusparse: 0.063 ms, cub: 0.021 ms, accel. ratio = 2.945
complex64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 3.063
complex64, shape=(10000, 10000), % nnz=1, cusparse: 0.065 ms, cub: 0.062 ms, accel. ratio = 1.041
complex64, shape=(10000, 10000), % nnz=5, cusparse: 0.212 ms, cub: 0.209 ms, accel. ratio = 1.012
complex64, shape=(10000, 10000), % nnz=10, cusparse: 0.384 ms, cub: 0.381 ms, accel. ratio = 1.008
complex128, shape=(100, 100), % nnz=0.1, cusparse: 0.063 ms, cub: 0.019 ms, accel. ratio = 3.320
complex128, shape=(100, 100), % nnz=1, cusparse: 0.061 ms, cub: 0.018 ms, accel. ratio = 3.341
complex128, shape=(100, 100), % nnz=5, cusparse: 0.066 ms, cub: 0.022 ms, accel. ratio = 3.039
complex128, shape=(100, 100), % nnz=10, cusparse: 0.066 ms, cub: 0.022 ms, accel. ratio = 3.025
complex128, shape=(1000, 1000), % nnz=0.1, cusparse: 0.065 ms, cub: 0.021 ms, accel. ratio = 3.052
complex128, shape=(1000, 1000), % nnz=1, cusparse: 0.069 ms, cub: 0.021 ms, accel. ratio = 3.290
complex128, shape=(1000, 1000), % nnz=5, cusparse: 0.064 ms, cub: 0.021 ms, accel. ratio = 2.998
complex128, shape=(1000, 1000), % nnz=10, cusparse: 0.065 ms, cub: 0.022 ms, accel. ratio = 3.014
complex128, shape=(10000, 10000), % nnz=0.1, cusparse: 0.068 ms, cub: 0.022 ms, accel. ratio = 3.059
complex128, shape=(10000, 10000), % nnz=1, cusparse: 0.107 ms, cub: 0.104 ms, accel. ratio = 1.029
complex128, shape=(10000, 10000), % nnz=5, cusparse: 0.359 ms, cub: 0.352 ms, accel. ratio = 1.021
complex128, shape=(10000, 10000), % nnz=10, cusparse: 0.662 ms, cub: 0.652 ms, accel. ratio = 1.015

leofang · 2019-11-25T15:00:59Z

@jakirkham @mrocklin @pentschev @anaruse You guys might be interested?

leofang · 2019-11-25T15:23:09Z

cupy/cuda/cub.pyx

+    ws = ndarray(ws_size, numpy.int8)
+    ws_ptr = <void*>ws.data.ptr


Just a nitpick: I noticed that in other modules like cufft and cudnn, we simply allocate raw buffers for temporary workspaces. I already added a fix for other CUB functions in #2682, could you do it here too?

ws = memory.alloc(ws_size) ws_ptr = <void*>ws.ptr

You may need to cimport the memory module depending on which PR is merged first 😛

#2682 is merged, so memory is imported and we just need to change these two lines.

pentschev · 2019-11-25T15:42:45Z

@cjnolet this reminds me of a similar conversation we had last week, maybe this may interest you too.

pentschev · 2019-11-25T15:43:58Z

I know there were some significant performance improvements in cuSOLVER starting with CUDA 10.1, not sure if cuSPARSE had anything similar though. What version of CUDA did you use to compute the benchmarks @grlee77 ?

grlee77 · 2019-11-25T15:52:03Z

The number above are from 10.0, but I think I saw similar timings for 10.1 on another computer. I can try with the latest release and see if it makes a difference.

pentschev · 2019-11-25T16:09:58Z

It would be nice to see if there's any improvement, but I can't guarantee if there really is. And of course, if you have the chance to do that.

anaruse · 2019-11-26T03:54:26Z

cuSPARSE is moving toward the new generic APIs starting with CUDA 10.1.
https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-api-reference
Most performance improvement come with the new APIs, though they are not supported by CuPy yet..

jakirkham · 2019-11-26T15:39:50Z

Thanks @anaruse! What do you think would be needed to adopt the new generic APIs in CuPy?

grlee77 · 2019-11-26T15:55:38Z

It does look like those new generic cuSPARSE APIs are quite flexible. I guess the main barrier currently is that they are still marked as a "preview feature" with lack of windows support and the possibility of API changes in a future release.

leofang · 2019-11-26T16:34:04Z

I suppose we want to wait until the API is stable?

emcastillo · 2019-11-27T05:23:44Z

I don't see any problem with merging this now and then compare it with cuSparse performance later. The changes are quite small so it's not going to be difficult to change if cuSparse outperforms it.

emcastillo

LGTM

emcastillo · 2019-11-27T05:24:08Z

Jenkins, test this please

pfn-ci-bot · 2019-11-27T05:24:12Z

Successfully created a job for commit 18cf698:

Dashboard for commit 18cf698

chainer-ci · 2019-11-27T07:26:04Z

Jenkins CI test (for commit 18cf698, target branch master) failed with status FAILURE.

cupyx/scipy/sparse/csr.py

cupy/cuda/cub.pyx

leofang · 2019-11-30T16:14:39Z

On a P100 + CUDA 9.2:

float32, shape=(100, 100), % nnz=0.1, cusparse: 0.20098822457449775 ms, cub: 0.054402403779082245 ms, accel. ratio = 3.694473233033461
float32, shape=(100, 100), % nnz=1, cusparse: 0.1646077912966735 ms, cub: 0.054956724500022945 ms, accel. ratio = 2.9952256579011505
float32, shape=(100, 100), % nnz=5, cusparse: 0.16242948105028454 ms, cub: 0.05498507344130928 ms, accel. ratio = 2.954065001361883
float32, shape=(100, 100), % nnz=10, cusparse: 0.1712879149346531 ms, cub: 0.061942927526020194 ms, accel. ratio = 2.7652537872495717
float32, shape=(1000, 1000), % nnz=0.1, cusparse: 0.17057937885721094 ms, cub: 0.061731719558748774 ms, accel. ratio = 2.7632371182350455
float32, shape=(1000, 1000), % nnz=1, cusparse: 0.170468670235916 ms, cub: 0.06188248595667612 ms, accel. ratio = 2.7547159361901037
float32, shape=(1000, 1000), % nnz=5, cusparse: 0.16941271277830708 ms, cub: 0.06151204314090631 ms, accel. ratio = 2.7541389316272835
float32, shape=(1000, 1000), % nnz=10, cusparse: 0.17253067142867798 ms, cub: 0.06266988919794546 ms, accel. ratio = 2.7530074432353415
float32, shape=(10000, 10000), % nnz=0.1, cusparse: 0.17119799456800885 ms, cub: 0.06251518128956952 ms, accel. ratio = 2.7385027290414774
float32, shape=(10000, 10000), % nnz=1, cusparse: 0.17224489389597117 ms, cub: 0.06268260715244053 ms, accel. ratio = 2.7478897531667976
float32, shape=(10000, 10000), % nnz=5, cusparse: 0.17253297274229956 ms, cub: 0.14272185622668657 ms, accel. ratio = 1.2088756221630392
float32, shape=(10000, 10000), % nnz=10, cusparse: 0.22302659530213426 ms, cub: 0.21132543766298775 ms, accel. ratio = 1.0553703225155837
float64, shape=(100, 100), % nnz=0.1, cusparse: 0.1629215981954199 ms, cub: 0.05473749107107177 ms, accel. ratio = 2.976416986008377
float64, shape=(100, 100), % nnz=1, cusparse: 0.16339425983012193 ms, cub: 0.05492400369329377 ms, accel. ratio = 2.974915316489799
float64, shape=(100, 100), % nnz=5, cusparse: 0.1710437460991235 ms, cub: 0.06267458101443675 ms, accel. ratio = 2.729076817597305
float64, shape=(100, 100), % nnz=10, cusparse: 0.17042667693511615 ms, cub: 0.062267992835937286 ms, accel. ratio = 2.7369868398384645
float64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.17115920750296698 ms, cub: 0.06292271720897626 ms, accel. ratio = 2.720149654925427
float64, shape=(1000, 1000), % nnz=1, cusparse: 0.17092645155630695 ms, cub: 0.06250764318357091 ms, accel. ratio = 2.7344888217003214
float64, shape=(1000, 1000), % nnz=5, cusparse: 0.172856583887217 ms, cub: 0.0705512689084423 ms, accel. ratio = 2.450084691056955
float64, shape=(1000, 1000), % nnz=10, cusparse: 0.18181381584636874 ms, cub: 0.0644018608457567 ms, accel. ratio = 2.8231143240071153
float64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.18257693368561415 ms, cub: 0.06512354831306302 ms, accel. ratio = 2.803547079589816
float64, shape=(10000, 10000), % nnz=1, cusparse: 0.1850577717276936 ms, cub: 0.06662394462611138 ms, accel. ratio = 2.777646576860978
float64, shape=(10000, 10000), % nnz=5, cusparse: 0.19740602046108702 ms, cub: 0.19803571929201555 ms, accel. ratio = 0.9968202764976959
float64, shape=(10000, 10000), % nnz=10, cusparse: 0.32511272945919556 ms, cub: 0.3264417519440522 ms, accel. ratio = 0.9959287607147617
complex64, shape=(100, 100), % nnz=0.1, cusparse: 0.16660188259378605 ms, cub: 0.05525725476830899 ms, accel. ratio = 3.015022792795982
complex64, shape=(100, 100), % nnz=1, cusparse: 0.16350517493725414 ms, cub: 0.05528300002666977 ms, accel. ratio = 2.9576031484972876
complex64, shape=(100, 100), % nnz=5, cusparse: 0.17065933702915023 ms, cub: 0.06279701840767944 ms, accel. ratio = 2.717634393423372
complex64, shape=(100, 100), % nnz=10, cusparse: 0.1711925308917297 ms, cub: 0.061819808354319924 ms, accel. ratio = 2.769218078298473
complex64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.17406002334926438 ms, cub: 0.06448460661846658 ms, accel. ratio = 2.6992492080957886
complex64, shape=(1000, 1000), % nnz=1, cusparse: 0.1710172196962057 ms, cub: 0.06285170986227794 ms, accel. ratio = 2.7209636789666094
complex64, shape=(1000, 1000), % nnz=5, cusparse: 0.17202172098280508 ms, cub: 0.06305608065319464 ms, accel. ratio = 2.728075059547645
complex64, shape=(1000, 1000), % nnz=10, cusparse: 0.17167484750156192 ms, cub: 0.06339062850355798 ms, accel. ratio = 2.7082054801196076
complex64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.17648706069359413 ms, cub: 0.06280610194573036 ms, accel. ratio = 2.810030478345774
complex64, shape=(10000, 10000), % nnz=1, cusparse: 0.1771703843147524 ms, cub: 0.0703957773024036 ms, accel. ratio = 2.516775737181939
complex64, shape=(10000, 10000), % nnz=5, cusparse: 0.21549569609019678 ms, cub: 0.2130589015047315 ms, accel. ratio = 1.0114371874080612
complex64, shape=(10000, 10000), % nnz=10, cusparse: 0.3609493987201012 ms, cub: 0.3586778902027705 ms, accel. ratio = 1.0063330040110543
complex128, shape=(100, 100), % nnz=0.1, cusparse: 0.1643229632487785 ms, cub: 0.05645799164724822 ms, accel. ratio = 2.910535044807738
complex128, shape=(100, 100), % nnz=1, cusparse: 0.1642995051185241 ms, cub: 0.05569871541065196 ms, accel. ratio = 2.9497898453705282
complex128, shape=(100, 100), % nnz=5, cusparse: 0.17335594450712763 ms, cub: 0.063424784923022 ms, accel. ratio = 2.733252382606073
complex128, shape=(100, 100), % nnz=10, cusparse: 0.17047256904988037 ms, cub: 0.06344125909877157 ms, accel. ratio = 2.6870930916498987
complex128, shape=(1000, 1000), % nnz=0.1, cusparse: 0.17004386755198328 ms, cub: 0.06255813424578184 ms, accel. ratio = 2.7181735772986064
complex128, shape=(1000, 1000), % nnz=1, cusparse: 0.17233010050232403 ms, cub: 0.06362792558387396 ms, accel. ratio = 2.7084035652735445
complex128, shape=(1000, 1000), % nnz=5, cusparse: 0.17250410199063368 ms, cub: 0.06353375848487873 ms, accel. ratio = 2.7151565735197343
complex128, shape=(1000, 1000), % nnz=10, cusparse: 0.170380763310746 ms, cub: 0.06324741530438653 ms, accel. ratio = 2.693877093487
complex128, shape=(10000, 10000), % nnz=0.1, cusparse: 0.17654482339847985 ms, cub: 0.0638507005107196 ms, accel. ratio = 2.7649629837473833
complex128, shape=(10000, 10000), % nnz=1, cusparse: 0.17896414676428715 ms, cub: 0.09450545677771935 ms, accel. ratio = 1.893691146165433
complex128, shape=(10000, 10000), % nnz=5, cusparse: 0.2897512466512262 ms, cub: 0.2867907763802432 ms, accel. ratio = 1.0103227527340624
complex128, shape=(10000, 10000), % nnz=10, cusparse: 0.5135994691115159 ms, cub: 0.5106760905339168 ms, accel. ratio = 1.0057245260386924

emcastillo · 2019-12-02T02:42:56Z

Once the comments are addressed, could you please squash the commits into one?

add device_csrmv to cub.pyx enable CUB-based __mul__ for csr_matrix make CUB workspace allocation consistent with master branch conditional CUB_related import in cupyx/scipy/sparse/csr.py Co-Authored-By: Leo Fang <leofang@bnl.gov>

grlee77 · 2019-12-02T22:01:37Z

Once the comments are addressed, could you please squash the commits into one?

Okay, done.

grlee77 · 2019-12-02T22:01:41Z

Also, I did end up benchmarking on 10.1 update 2 and got a very similar result as for 10.0.

emcastillo · 2019-12-03T06:10:50Z

Jenkins, test this please

pfn-ci-bot · 2019-12-03T06:10:55Z

Successfully created a job for commit a6d85b7:

Dashboard for commit a6d85b7

chainer-ci · 2019-12-03T07:01:39Z

Jenkins CI test (for commit a6d85b7, target branch master) succeeded!

leofang · 2020-06-11T07:29:33Z

When adding tests for this PR (see #3428) I noticed that cuSPARSE internally uses cub::DeviceSpmv as well (checked by looking at kernel names in nvprof)...I am wondering if the observed speed-up is simply due to some overhead in cuSPARSE, or cuSPARSE not using CUB for certain cases.

leofang · 2020-06-11T16:58:59Z

Update: Looks like cuSPASE will be on par with CUB after #3430 is merged. Would be great if the test script above can be rerun with that change.

leofang · 2020-06-11T17:05:36Z

Update: CUB still wins for small nnz?!

2080 Ti + CUDA 10.0:

float32, shape=(100, 100), % nnz=0.1, cusparse: 0.03288137027866406 ms, cub: 0.016180590167437812 ms, accel. ratio = 2.032149009300988
float32, shape=(100, 100), % nnz=1, cusparse: 0.035954668359292445 ms, cub: 0.017097071107115128 ms, accel. ratio = 2.1029723824643582
float32, shape=(100, 100), % nnz=5, cusparse: 0.035025128362892274 ms, cub: 0.01702057667875161 ms, accel. ratio = 2.057810908758306
float32, shape=(100, 100), % nnz=10, cusparse: 0.0379940815697684 ms, cub: 0.01978315164313934 ms, accel. ratio = 1.920527237273869
float32, shape=(1000, 1000), % nnz=0.1, cusparse: 0.03830483873376732 ms, cub: 0.01982111028995662 ms, accel. ratio = 1.9325274000002122
float32, shape=(1000, 1000), % nnz=1, cusparse: 0.039223355823491575 ms, cub: 0.01976819310947474 ms, accel. ratio = 1.9841649465014648
float32, shape=(1000, 1000), % nnz=5, cusparse: 0.038054679209698676 ms, cub: 0.02019308933486832 ms, accel. ratio = 1.8845397342935506
float32, shape=(1000, 1000), % nnz=10, cusparse: 0.03805819548404062 ms, cub: 0.019400639561633154 ms, accel. ratio = 1.9616979823337777
float32, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03747650272938428 ms, cub: 0.019134078895189484 ms, accel. ratio = 1.9586259121575107
float32, shape=(10000, 10000), % nnz=1, cusparse: 0.03788194718976511 ms, cub: 0.027648207916994325 ms, accel. ratio = 1.370141142727753
float32, shape=(10000, 10000), % nnz=5, cusparse: 0.09055808186531067 ms, cub: 0.09002039829889934 ms, accel. ratio = 1.0059729081027395
float32, shape=(10000, 10000), % nnz=10, cusparse: 0.15603062472765958 ms, cub: 0.160061105897155 ms, accel. ratio = 0.9748191095712837
float64, shape=(100, 100), % nnz=0.1, cusparse: 0.03594571030634855 ms, cub: 0.01725200063277966 ms, accel. ratio = 2.0835676436302646
float64, shape=(100, 100), % nnz=1, cusparse: 0.03518878749769574 ms, cub: 0.017290551064896513 ms, accel. ratio = 2.0351455176657987
float64, shape=(100, 100), % nnz=5, cusparse: 0.03814147143669144 ms, cub: 0.01998262603672679 ms, accel. ratio = 1.9087316835429864
float64, shape=(100, 100), % nnz=10, cusparse: 0.03856049715476521 ms, cub: 0.020988052812750357 ms, accel. ratio = 1.8372593922262048
float64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.0394091264238102 ms, cub: 0.019844570850355053 ms, accel. ratio = 1.9858895776073233
float64, shape=(1000, 1000), % nnz=1, cusparse: 0.03946489153546851 ms, cub: 0.02008244811890185 ms, accel. ratio = 1.9651434577004416
float64, shape=(1000, 1000), % nnz=5, cusparse: 0.03950237239531929 ms, cub: 0.020149570183465946 ms, accel. ratio = 1.9604573217017602
float64, shape=(1000, 1000), % nnz=10, cusparse: 0.03942697505616797 ms, cub: 0.019986594722439282 ms, accel. ratio = 1.9726709628980794
float64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03773242503673107 ms, cub: 0.01926751466126771 ms, accel. ratio = 1.9583441715283718
float64, shape=(10000, 10000), % nnz=1, cusparse: 0.04608750803590281 ms, cub: 0.046091190176120594 ms, accel. ratio = 0.9999201118434192
float64, shape=(10000, 10000), % nnz=5, cusparse: 0.1424588768730336 ms, cub: 0.1474947950958666 ms, accel. ratio = 0.9658569767186711
float64, shape=(10000, 10000), % nnz=10, cusparse: 0.25253528501929307 ms, cub: 0.26703462368104514 ms, accel. ratio = 0.9457024019511772
complex64, shape=(100, 100), % nnz=0.1, cusparse: 0.035181638686929466 ms, cub: 0.017354517521590788 ms, accel. ratio = 2.0272323124604257
complex64, shape=(100, 100), % nnz=1, cusparse: 0.03498031130607578 ms, cub: 0.01751938246325218 ms, accel. ratio = 1.9966634885361292
complex64, shape=(100, 100), % nnz=5, cusparse: 0.037810314525732185 ms, cub: 0.020132326646416492 ms, accel. ratio = 1.878089660961384
complex64, shape=(100, 100), % nnz=10, cusparse: 0.037990097683660184 ms, cub: 0.020220227748248353 ms, accel. ratio = 1.8788165077394448
complex64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.03949731360081808 ms, cub: 0.02026617429539406 ms, accel. ratio = 1.9489279537971171
complex64, shape=(1000, 1000), % nnz=1, cusparse: 0.0394272004836946 ms, cub: 0.020245663069758283 ms, accel. ratio = 1.9474393280103781
complex64, shape=(1000, 1000), % nnz=5, cusparse: 0.03983381247427596 ms, cub: 0.02030750622949472 ms, accel. ratio = 1.961531466449639
complex64, shape=(1000, 1000), % nnz=10, cusparse: 0.039254328835754956 ms, cub: 0.02023206979532953 ms, accel. ratio = 1.9402033125061984
complex64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03802776336669922 ms, cub: 0.01952286922570431 ms, accel. ratio = 1.947857301458071
complex64, shape=(10000, 10000), % nnz=1, cusparse: 0.046301805056058444 ms, cub: 0.04304097248957708 ms, accel. ratio = 1.0757611266165286
complex64, shape=(10000, 10000), % nnz=5, cusparse: 0.140159780328924 ms, cub: 0.14214515686035156 ms, accel. ratio = 0.9860327528895124
complex64, shape=(10000, 10000), % nnz=10, cusparse: 0.2610818626954383 ms, cub: 0.2682221304510058 ms, accel. ratio = 0.973379274321767
complex128, shape=(100, 100), % nnz=0.1, cusparse: 0.034599749841422674 ms, cub: 0.016676154092093495 ms, accel. ratio = 2.074803917638727
complex128, shape=(100, 100), % nnz=1, cusparse: 0.03305511217458699 ms, cub: 0.01641655825430638 ms, accel. ratio = 2.0135226679389997
complex128, shape=(100, 100), % nnz=5, cusparse: 0.036014381307840994 ms, cub: 0.01920604234972773 ms, accel. ratio = 1.8751589032266998
complex128, shape=(100, 100), % nnz=10, cusparse: 0.036298141609359486 ms, cub: 0.01925474010565548 ms, accel. ratio = 1.8851535471360652
complex128, shape=(1000, 1000), % nnz=0.1, cusparse: 0.037915798615741826 ms, cub: 0.01939381296295239 ms, accel. ratio = 1.9550461112609268
complex128, shape=(1000, 1000), % nnz=1, cusparse: 0.03746877170982968 ms, cub: 0.019219932719555684 ms, accel. ratio = 1.9494746551170998
complex128, shape=(1000, 1000), % nnz=5, cusparse: 0.038534655231107016 ms, cub: 0.019247462905959165 ms, accel. ratio = 2.002064137979265
complex128, shape=(1000, 1000), % nnz=10, cusparse: 0.03745986285214932 ms, cub: 0.019266286985205938 ms, accel. ratio = 1.944321855109588
complex128, shape=(10000, 10000), % nnz=0.1, cusparse: 0.038168942968999975 ms, cub: 0.01963576359146691 ms, accel. ratio = 1.9438481621151216
complex128, shape=(10000, 10000), % nnz=1, cusparse: 0.07437754280959503 ms, cub: 0.06749961949601958 ms, accel. ratio = 1.1018957345971563
complex128, shape=(10000, 10000), % nnz=5, cusparse: 0.267100457700423 ms, cub: 0.25017767990191364 ms, accel. ratio = 1.067643035962058
complex128, shape=(10000, 10000), % nnz=10, cusparse: 0.48387912382562476 ms, cub: 0.45938951423369256 ms, accel. ratio = 1.0533090304265724

grlee77 · 2020-06-11T18:19:31Z

When adding tests for this PR (see #3428) I noticed that cuSPARSE internally uses cub::DeviceSpmv as well (checked by looking at kernel names in nvprof)

I did not know cuSPARSE used CUB internally. If I had, I probably wouldn't have bothered to try it separately.

Another complicating factor is that it seems that cusparse.csrmv will no longer be available as of CUDA 11 and cusparse.spmv will need to be used instead. Fortunately, @anaruse has already implemented the needed fixes for this in #3405.

emcastillo self-assigned this Nov 25, 2019

leofang reviewed Nov 25, 2019

View reviewed changes

leofang mentioned this pull request Nov 25, 2019

Discussion for possible enhancements of the new CUB support #2519

Closed

10 tasks

leofang mentioned this pull request Nov 25, 2019

Implement multiply for cupyx.scipy.sparse.csr.csr_matrix #2675

Closed

niboshi mentioned this pull request Nov 26, 2019

Sparse matrix multiply orders of magnitude slower than PyTorch #2664

Closed

emcastillo changed the title ~~CUB-based CSR sparse matrix multiply~~ CUB-based CSR sparse matrix vector multiply Nov 27, 2019

emcastillo approved these changes Nov 27, 2019

View reviewed changes

leofang reviewed Nov 27, 2019

View reviewed changes

cupyx/scipy/sparse/csr.py Outdated Show resolved Hide resolved

grlee77 force-pushed the cub_device_spmv branch from 2df870f to 6c167ef Compare November 29, 2019 20:23

leofang reviewed Nov 30, 2019

View reviewed changes

cupy/cuda/cub.pyx Outdated Show resolved Hide resolved

CUB: add CSR sparse-matrix dense-vector multiply

a6d85b7

add device_csrmv to cub.pyx enable CUB-based __mul__ for csr_matrix make CUB workspace allocation consistent with master branch conditional CUB_related import in cupyx/scipy/sparse/csr.py Co-Authored-By: Leo Fang <leofang@bnl.gov>

grlee77 force-pushed the cub_device_spmv branch from 3e61ca3 to a6d85b7 Compare December 2, 2019 22:00

emcastillo approved these changes Dec 3, 2019

View reviewed changes

emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Dec 3, 2019

emcastillo added this to the v7.0.0 milestone Dec 3, 2019

emcastillo added the cat:performance Performance in terms of speed or memory consumption label Dec 3, 2019

emcastillo merged commit 49af1c2 into cupy:master Dec 3, 2019

leofang mentioned this pull request Jun 11, 2020

Test CUB SpMV #3428

Merged

leofang mentioned this pull request Jun 11, 2020

Speed up CSR SpMV by orders of magnitude #3430

Merged

grlee77 deleted the cub_device_spmv branch September 9, 2020 15:16

		ws = ndarray(ws_size, numpy.int8)
		ws_ptr = <void*>ws.data.ptr

Uh oh!

Conversation

grlee77 commented Nov 24, 2019

Uh oh!

grlee77 commented Nov 24, 2019

Uh oh!

leofang commented Nov 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang Nov 25, 2019

Choose a reason for hiding this comment

Uh oh!

leofang Nov 27, 2019

Choose a reason for hiding this comment

Uh oh!

pentschev commented Nov 25, 2019

Uh oh!

pentschev commented Nov 25, 2019

Uh oh!

grlee77 commented Nov 25, 2019

Uh oh!

pentschev commented Nov 25, 2019

Uh oh!

anaruse commented Nov 26, 2019

Uh oh!

jakirkham commented Nov 26, 2019

Uh oh!

grlee77 commented Nov 26, 2019

Uh oh!

leofang commented Nov 26, 2019

Uh oh!

emcastillo commented Nov 27, 2019

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Nov 27, 2019

Uh oh!

pfn-ci-bot commented Nov 27, 2019

Uh oh!

chainer-ci commented Nov 27, 2019

Uh oh!

Uh oh!

Uh oh!

leofang commented Nov 30, 2019

Uh oh!

emcastillo commented Dec 2, 2019

Uh oh!

grlee77 commented Dec 2, 2019

Uh oh!

grlee77 commented Dec 2, 2019

Uh oh!

emcastillo commented Dec 3, 2019

Uh oh!

pfn-ci-bot commented Dec 3, 2019

Uh oh!

chainer-ci commented Dec 3, 2019

Uh oh!

leofang commented Jun 11, 2020

Uh oh!

leofang commented Jun 11, 2020

Uh oh!

leofang commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grlee77 commented Jun 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

leofang commented Nov 25, 2019 •

edited

Loading

leofang commented Jun 11, 2020 •

edited

Loading