CUB-based CSR sparse matrix vector multiply#2698
Conversation
|
Here is a benchmark for randomly distributed non-zero values. The last value in the output is the acceleration factor. from time import time
import cupy as cp
import numpy as np
import scipy.stats
from math import ceil
from cupyx.scipy.sparse import csr_matrix, csc_matrix
from cupy.cuda.cub import device_csrmv
d = cp.cuda.Device()
for dtype in [np.float32, np.float64, np.complex64, np.complex128]:
for shape in [(100, 100), (1000, 1000), (10000, 10000)]:
for percent_nnz in [0.1, 1, 5, 10]: # different levels of sparsity (% of non-zero values)
mask = cp.random.randn(*shape) > scipy.stats.norm.ppf(1 - percent_nnz / 100)
while mask.sum() == 0:
# may need to retry to get at least one non-zero entries for small sizes
mask = cp.random.randn(*shape) > scipy.stats.norm.ppf(1 - percent_nnz / 100)
# print("actual percent nnz = {}".format(mask.sum() / mask.size))
# mask a dense matrix of desired density to create a sparse matrix
A = csr_matrix(cp.arange(mask.size, dtype=dtype).reshape(mask.shape) * mask)
x = cp.ones(A.shape[1], dtype=A.dtype)
if x.dtype.kind == 'c':
x = x + 1j * x
# set reps to last ~1 second based on cuSPARSE duration
cp.cuda.cub_enabled = False
tstart = time()
dur1 = A * x
d.synchronize()
reps = max(ceil(1.0 / (time() - tstart)), 4)
# disable CUB to test using cuSPARSE
cp.cuda.cub_enabled = False
tstart = time()
for n in range(reps):
y = A * x
d.synchronize()
duration_cusparse = (time() - tstart) / reps * 1000 # ms
# enable CUB
cp.cuda.cub_enabled = True
tstart = time()
for n in range(reps):
y2 = A * x
d.synchronize()
duration_cub = (time() - tstart) / reps * 1000 # ms
print(f"{np.dtype(dtype).name}, shape={shape}, % nnz={percent_nnz}, cusparse: {duration_cusparse} ms, cub: {duration_cub} ms, accel. ratio = {duration_cusparse/duration_cub}")
cp.testing.assert_allclose(y, y2, atol=1e-5, rtol=1e-5) |
|
@jakirkham @mrocklin @pentschev @anaruse You guys might be interested? |
cupy/cuda/cub.pyx
Outdated
| ws = ndarray(ws_size, numpy.int8) | ||
| ws_ptr = <void*>ws.data.ptr |
There was a problem hiding this comment.
Just a nitpick: I noticed that in other modules like cufft and cudnn, we simply allocate raw buffers for temporary workspaces. I already added a fix for other CUB functions in #2682, could you do it here too?
ws = memory.alloc(ws_size)
ws_ptr = <void*>ws.ptrYou may need to cimport the memory module depending on which PR is merged first 😛
There was a problem hiding this comment.
#2682 is merged, so memory is imported and we just need to change these two lines.
|
@cjnolet this reminds me of a similar conversation we had last week, maybe this may interest you too. |
|
I know there were some significant performance improvements in cuSOLVER starting with CUDA 10.1, not sure if cuSPARSE had anything similar though. What version of CUDA did you use to compute the benchmarks @grlee77 ? |
|
The number above are from 10.0, but I think I saw similar timings for 10.1 on another computer. I can try with the latest release and see if it makes a difference. |
|
It would be nice to see if there's any improvement, but I can't guarantee if there really is. And of course, if you have the chance to do that. |
|
cuSPARSE is moving toward the new generic APIs starting with CUDA 10.1. |
|
Thanks @anaruse! What do you think would be needed to adopt the new generic APIs in CuPy? |
|
It does look like those new generic cuSPARSE APIs are quite flexible. I guess the main barrier currently is that they are still marked as a "preview feature" with lack of windows support and the possibility of API changes in a future release. |
|
I suppose we want to wait until the API is stable? |
|
I don't see any problem with merging this now and then compare it with cuSparse performance later. The changes are quite small so it's not going to be difficult to change if cuSparse outperforms it. |
|
Jenkins, test this please |
|
Successfully created a job for commit 18cf698: |
|
Jenkins CI test (for commit 18cf698, target branch master) failed with status FAILURE. |
2df870f to
6c167ef
Compare
|
On a P100 + CUDA 9.2: |
|
Once the comments are addressed, could you please squash the commits into one? |
add device_csrmv to cub.pyx enable CUB-based __mul__ for csr_matrix make CUB workspace allocation consistent with master branch conditional CUB_related import in cupyx/scipy/sparse/csr.py Co-Authored-By: Leo Fang <leofang@bnl.gov>
3e61ca3 to
a6d85b7
Compare
Okay, done. |
|
Also, I did end up benchmarking on 10.1 update 2 and got a very similar result as for 10.0. |
|
Jenkins, test this please |
|
Successfully created a job for commit a6d85b7: |
|
Jenkins CI test (for commit a6d85b7, target branch master) succeeded! |
|
When adding tests for this PR (see #3428) I noticed that cuSPARSE internally uses |
|
Update: Looks like cuSPASE will be on par with CUB after #3430 is merged. Would be great if the test script above can be rerun with that change. |
|
Update: CUB still wins for small nnz?! 2080 Ti + CUDA 10.0: float32, shape=(100, 100), % nnz=0.1, cusparse: 0.03288137027866406 ms, cub: 0.016180590167437812 ms, accel. ratio = 2.032149009300988
float32, shape=(100, 100), % nnz=1, cusparse: 0.035954668359292445 ms, cub: 0.017097071107115128 ms, accel. ratio = 2.1029723824643582
float32, shape=(100, 100), % nnz=5, cusparse: 0.035025128362892274 ms, cub: 0.01702057667875161 ms, accel. ratio = 2.057810908758306
float32, shape=(100, 100), % nnz=10, cusparse: 0.0379940815697684 ms, cub: 0.01978315164313934 ms, accel. ratio = 1.920527237273869
float32, shape=(1000, 1000), % nnz=0.1, cusparse: 0.03830483873376732 ms, cub: 0.01982111028995662 ms, accel. ratio = 1.9325274000002122
float32, shape=(1000, 1000), % nnz=1, cusparse: 0.039223355823491575 ms, cub: 0.01976819310947474 ms, accel. ratio = 1.9841649465014648
float32, shape=(1000, 1000), % nnz=5, cusparse: 0.038054679209698676 ms, cub: 0.02019308933486832 ms, accel. ratio = 1.8845397342935506
float32, shape=(1000, 1000), % nnz=10, cusparse: 0.03805819548404062 ms, cub: 0.019400639561633154 ms, accel. ratio = 1.9616979823337777
float32, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03747650272938428 ms, cub: 0.019134078895189484 ms, accel. ratio = 1.9586259121575107
float32, shape=(10000, 10000), % nnz=1, cusparse: 0.03788194718976511 ms, cub: 0.027648207916994325 ms, accel. ratio = 1.370141142727753
float32, shape=(10000, 10000), % nnz=5, cusparse: 0.09055808186531067 ms, cub: 0.09002039829889934 ms, accel. ratio = 1.0059729081027395
float32, shape=(10000, 10000), % nnz=10, cusparse: 0.15603062472765958 ms, cub: 0.160061105897155 ms, accel. ratio = 0.9748191095712837
float64, shape=(100, 100), % nnz=0.1, cusparse: 0.03594571030634855 ms, cub: 0.01725200063277966 ms, accel. ratio = 2.0835676436302646
float64, shape=(100, 100), % nnz=1, cusparse: 0.03518878749769574 ms, cub: 0.017290551064896513 ms, accel. ratio = 2.0351455176657987
float64, shape=(100, 100), % nnz=5, cusparse: 0.03814147143669144 ms, cub: 0.01998262603672679 ms, accel. ratio = 1.9087316835429864
float64, shape=(100, 100), % nnz=10, cusparse: 0.03856049715476521 ms, cub: 0.020988052812750357 ms, accel. ratio = 1.8372593922262048
float64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.0394091264238102 ms, cub: 0.019844570850355053 ms, accel. ratio = 1.9858895776073233
float64, shape=(1000, 1000), % nnz=1, cusparse: 0.03946489153546851 ms, cub: 0.02008244811890185 ms, accel. ratio = 1.9651434577004416
float64, shape=(1000, 1000), % nnz=5, cusparse: 0.03950237239531929 ms, cub: 0.020149570183465946 ms, accel. ratio = 1.9604573217017602
float64, shape=(1000, 1000), % nnz=10, cusparse: 0.03942697505616797 ms, cub: 0.019986594722439282 ms, accel. ratio = 1.9726709628980794
float64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03773242503673107 ms, cub: 0.01926751466126771 ms, accel. ratio = 1.9583441715283718
float64, shape=(10000, 10000), % nnz=1, cusparse: 0.04608750803590281 ms, cub: 0.046091190176120594 ms, accel. ratio = 0.9999201118434192
float64, shape=(10000, 10000), % nnz=5, cusparse: 0.1424588768730336 ms, cub: 0.1474947950958666 ms, accel. ratio = 0.9658569767186711
float64, shape=(10000, 10000), % nnz=10, cusparse: 0.25253528501929307 ms, cub: 0.26703462368104514 ms, accel. ratio = 0.9457024019511772
complex64, shape=(100, 100), % nnz=0.1, cusparse: 0.035181638686929466 ms, cub: 0.017354517521590788 ms, accel. ratio = 2.0272323124604257
complex64, shape=(100, 100), % nnz=1, cusparse: 0.03498031130607578 ms, cub: 0.01751938246325218 ms, accel. ratio = 1.9966634885361292
complex64, shape=(100, 100), % nnz=5, cusparse: 0.037810314525732185 ms, cub: 0.020132326646416492 ms, accel. ratio = 1.878089660961384
complex64, shape=(100, 100), % nnz=10, cusparse: 0.037990097683660184 ms, cub: 0.020220227748248353 ms, accel. ratio = 1.8788165077394448
complex64, shape=(1000, 1000), % nnz=0.1, cusparse: 0.03949731360081808 ms, cub: 0.02026617429539406 ms, accel. ratio = 1.9489279537971171
complex64, shape=(1000, 1000), % nnz=1, cusparse: 0.0394272004836946 ms, cub: 0.020245663069758283 ms, accel. ratio = 1.9474393280103781
complex64, shape=(1000, 1000), % nnz=5, cusparse: 0.03983381247427596 ms, cub: 0.02030750622949472 ms, accel. ratio = 1.961531466449639
complex64, shape=(1000, 1000), % nnz=10, cusparse: 0.039254328835754956 ms, cub: 0.02023206979532953 ms, accel. ratio = 1.9402033125061984
complex64, shape=(10000, 10000), % nnz=0.1, cusparse: 0.03802776336669922 ms, cub: 0.01952286922570431 ms, accel. ratio = 1.947857301458071
complex64, shape=(10000, 10000), % nnz=1, cusparse: 0.046301805056058444 ms, cub: 0.04304097248957708 ms, accel. ratio = 1.0757611266165286
complex64, shape=(10000, 10000), % nnz=5, cusparse: 0.140159780328924 ms, cub: 0.14214515686035156 ms, accel. ratio = 0.9860327528895124
complex64, shape=(10000, 10000), % nnz=10, cusparse: 0.2610818626954383 ms, cub: 0.2682221304510058 ms, accel. ratio = 0.973379274321767
complex128, shape=(100, 100), % nnz=0.1, cusparse: 0.034599749841422674 ms, cub: 0.016676154092093495 ms, accel. ratio = 2.074803917638727
complex128, shape=(100, 100), % nnz=1, cusparse: 0.03305511217458699 ms, cub: 0.01641655825430638 ms, accel. ratio = 2.0135226679389997
complex128, shape=(100, 100), % nnz=5, cusparse: 0.036014381307840994 ms, cub: 0.01920604234972773 ms, accel. ratio = 1.8751589032266998
complex128, shape=(100, 100), % nnz=10, cusparse: 0.036298141609359486 ms, cub: 0.01925474010565548 ms, accel. ratio = 1.8851535471360652
complex128, shape=(1000, 1000), % nnz=0.1, cusparse: 0.037915798615741826 ms, cub: 0.01939381296295239 ms, accel. ratio = 1.9550461112609268
complex128, shape=(1000, 1000), % nnz=1, cusparse: 0.03746877170982968 ms, cub: 0.019219932719555684 ms, accel. ratio = 1.9494746551170998
complex128, shape=(1000, 1000), % nnz=5, cusparse: 0.038534655231107016 ms, cub: 0.019247462905959165 ms, accel. ratio = 2.002064137979265
complex128, shape=(1000, 1000), % nnz=10, cusparse: 0.03745986285214932 ms, cub: 0.019266286985205938 ms, accel. ratio = 1.944321855109588
complex128, shape=(10000, 10000), % nnz=0.1, cusparse: 0.038168942968999975 ms, cub: 0.01963576359146691 ms, accel. ratio = 1.9438481621151216
complex128, shape=(10000, 10000), % nnz=1, cusparse: 0.07437754280959503 ms, cub: 0.06749961949601958 ms, accel. ratio = 1.1018957345971563
complex128, shape=(10000, 10000), % nnz=5, cusparse: 0.267100457700423 ms, cub: 0.25017767990191364 ms, accel. ratio = 1.067643035962058
complex128, shape=(10000, 10000), % nnz=10, cusparse: 0.48387912382562476 ms, cub: 0.45938951423369256 ms, accel. ratio = 1.0533090304265724 |
I did not know cuSPARSE used CUB internally. If I had, I probably wouldn't have bothered to try it separately. Another complicating factor is that it seems that |
I was curious whether CUB would also improve the performance of sparse matrix multiplication. It seems that for < ~ 1,000,000 non-zero entries it is faster. At sizes larger than that, it is basically tied with cuSPARSE. A bit disappointing that the >3 fold gains at smaller sizes don't hold up for the larger problems of interest, although would be curious if this is also true for higher-end NVIDIA cards. (I tested with a GTX 1080 Ti)
The main limitation is that CUB only has this one sparse matrix function so it is limited to CSR only. Aside from performance, I think the other feature it has over cuSPARSE is that it can be used for additional dtypes such as
intthat are not supported by cuSPARSE. This can be done with the undocumenteddevice_spmvfunction here, but the end-user interface is via the existing__mul__method ofcsr_matrxand is thus limited to floating point types.I will post a benchmark below.