Speed up CSR SpMV by orders of magnitude#3430
Conversation
|
TODO: compare with CUB SpMV, as I suspect they should be on par. |
|
Benchmarking: (look at the last line in the output) script: (click me)import numpy as np
import cupy as cp
import cupyx
from cupyx.time import repeat
import scipy.sparse
# switch CUB on and off here
cp.cuda.cub_enabled = True
# matrix dimensions
M = np.int64(1e6)
N = M
nnz = M * 100
Krow=np.random.randint(0, high=M, size=nnz//2, dtype='l')
Kcol=np.random.randint(0, high=M, size=nnz//2, dtype='l')
# make duplicates
Krow=np.concatenate((Krow, Krow))
Kcol=np.concatenate((Kcol, Kcol))
Kval=np.random.rand(nnz).astype('float32')
# use scipy
S=scipy.sparse.csr_matrix((Kval.ravel(), (Krow.ravel(), Kcol.ravel())), shape=(M,N))
ST=scipy.sparse.csc_matrix((Kval.ravel(), (Kcol.ravel(), Krow.ravel())), shape=(N,M))
# vector to multiply
x=np.random.rand(M).astype('float32')
#print("left multiply (CSR)")
Sx = S * x
time_scipy = repeat(S.__mul__, (x,), n_repeat=100).cpu_times.mean()
#print("right multiply (CSC)")
xS = x * ST
time_scipyrm = repeat(x.__mul__, (ST,), n_repeat=100).cpu_times.mean()
print("allclose CPU:", np.allclose(Sx, xS)) #, "time:", time_scipy, time_scipyrm)
# use cupy
Krow=cp.asarray(Krow)
Kcol=cp.asarray(Kcol)
Kval=cp.asarray(Kval)
x =cp.asarray(x)
S=cupyx.scipy.sparse.coo_matrix((Kval.ravel(),(Krow.ravel(), Kcol.ravel())), shape=(M,N))
S=cupyx.scipy.sparse.csr_matrix(S)
ST=cupyx.scipy.sparse.coo_matrix((Kval.ravel(), (Kcol.ravel(), Krow.ravel())), shape=(N,M))
ST=cupyx.scipy.sparse.csc_matrix(ST)
#print("left multiply by csr")
cSx = S * x
time_cupy1 = repeat(S.__mul__, (x,), n_repeat=100).gpu_times.mean()
#print("right multiply by csc")
cxS = x * ST
time_cupy1rm = repeat(x.__mul__, (ST,), n_repeat=100).gpu_times.mean()
xS=cp.asarray(xS)
Sx=cp.asarray(Sx)
print("allclose GPU:", cp.allclose(cSx, cxS))
print("allclose GPU-CPU: (rm) ", cp.allclose(cSx, xS),'(lm)', cp.allclose(cSx, Sx))
print("time scipy: lm:", time_scipy, "rm:", time_scipyrm)
print("time cupy: lm:", time_cupy1, "rm:", time_cupy1rm)Before this PR: time cupy: lm: 0.0012003951942920685 rm: 1.6158148022460939 # CUB is on
time cupy: lm: 0.0012251673579216002 rm: 1.8342621813964843 # CUB is offWith this PR (>1300x speedup!): time cupy: lm: 0.0011969430339336395 rm: 0.0012273990404605866 # CUB is on
time cupy: lm: 0.0012167888033390046 rm: 0.0012395257604122165 # CUB is off |
|
See the benchmark with this PR added against CUB here: #2698 (comment). |
cupyx/scipy/sparse/coo.py
Outdated
| self._has_canonical_format = True | ||
| return | ||
| keys = cupy.stack([self.col, self.row]) | ||
| keys = cupy.stack([self.row, self.col]) |
There was a problem hiding this comment.
Are you sure about this change? The description you quoted states:
"the index arrays are first sorted by row indices and then within the same row by column indices"
Since lexsort uses the last row as the primary sort key, didn't the already existing code match that description?
In either case, it seems it can be verified empirically against SciPy. If all test cases pass either way it seems that would indicate a hole in our test coverage!
There was a problem hiding this comment.
I need your help, Gregory. I've been looking into this for a few hours, and I am super confused now.
It seems the change at this line affects neither speed nor correctness, at least with my test script. (And FYI, the whole test suite tests/cupyx_tests/scipy_tests/sparse_tests/ passed either way!) The only thing that matters (for speed) is to set _has_canonical_format=True, which makes sense because when calling sum_duplicates() both CSC & CSR converts to COO (that's why changes in coo.py affects CSR), and this flag would provide a fast path. Does it mean keys, lexsort, etc here are simply dead code? I am lost...
There was a problem hiding this comment.
I haven't looked at it closely, but was just wandering what motivated the change. Ca n try to take a look tomorrow
There was a problem hiding this comment.
I now incline that this was simply an overlook of previous PRs and that we don't need to change this line. After coo.sum_duplicates(), _has_canonical_format should always be True (in line with SciPy). The only concern is that cusparse.coosort() sorts rows by default, and for tocsc() it doesn't sound right, so I will address both issues.
There was a problem hiding this comment.
The csc format is stored in column-major format, whereas csr is in row-major,
so I think a CSC matrix is stored internally as a transpose of a CSR matrix.
There was a problem hiding this comment.
CSR/CSC matrix internally stores 3 arrays: data of size nnz, indices of size nnz, and indptr of size m+1/n+1. This is the same in both cuSPASE and SciPy. A transpose operation would need to update indices and indptr.
There was a problem hiding this comment.
Revisiting this line, interestingly I found SciPy actually does what @smarkesini did (sort by col first): https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/sparse/coo.py#L543. I wonder why no test suite has caught this deviation so far (and how to write one)...
|
Wow, not sure how this got overlooked. That is an amazing improvement for very little change! |
|
I don't know why I didn't look this in depth when the issue was created. |
| cusparse.xcoo2csr( | ||
| handle, x.col.data.ptr, x.nnz, n, | ||
| indptr.data.ptr, cusparse.CUSPARSE_INDEX_BASE_ZERO) |
There was a problem hiding this comment.
This works for coo2csc according to the documentation of cusparse<t>coo2csr():
It can also be used to convert the array containing the uncompressed column indices (corresponding to COO format) into an array of column pointers (corresponding to CSC format).
|
I made the conversion from COO to CSC/CSR symmetric. This requires |
| shape = self.shape[1], self.shape[0] | ||
| return csc.csc_matrix( | ||
| trans = csc.csc_matrix( | ||
| (self.data, self.indices, self.indptr), shape=shape, copy=copy) | ||
| trans._has_canonical_format = self._has_canonical_format | ||
| return trans |
There was a problem hiding this comment.
I am still concerned with this transpose operation (and also for its csc counterpart). If I understand it correctly, self.indices and self.indptr should be updated during transpose. Am I missing something?
There was a problem hiding this comment.
I think for the transpose operation as used in e.g. csrmv in cupy/cusparse.py ...cusparse.spMV(), you compute the multiplication directly with the transposed matrix using a _transpose_flag - without actually computing the transpose.
There was a problem hiding this comment.
It seems SciPy does this too. I don't understand...
There was a problem hiding this comment.
We left comments simultaneously, @smarkesini! It's good to know _transpose_flag -- I wasn't aware of that. But the transpose method here is for high-level operations, say, b=a.T.
There was a problem hiding this comment.
Generally the sparse matrix operations follow the sparse BLAS conventions, with transpose operations computed internally, so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works.
There was a problem hiding this comment.
I dug deeper into SciPy, and it turns out that for CSR:
tocscwill ensure the indices to be sorted, but the outcome is not necessary in the canonical form (i.e., can have repeated indices)transposewill ensure nothing.
This observation also holds vice versa for CSC. So, the change here in csr.py is not necessary, and csc.py will need to be fixed.
There was a problem hiding this comment.
so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works
Right, that's what I expected, but I dunno where this flag is...😅
There was a problem hiding this comment.
I guess shape might be used for this purpose...?
|
Hi @emcastillo @grlee77 @smarkesini @cjnolet This is ready for review. The performance improvement is same as reported earlier. I left one question, though, which doesn't seem to be relevant to this PR (not sure why). |
| _call_cusparse( | ||
| 'csr2csc', x.dtype, | ||
| handle, n, m, nnz, x.data.data.ptr, | ||
| x.indptr.data.ptr, x.indices.data.ptr, | ||
| data.data.ptr, indices.data.ptr, indptr.data.ptr, | ||
| cusparse.CUSPARSE_ACTION_NUMERIC, | ||
| cusparse.CUSPARSE_INDEX_BASE_ZERO) |
There was a problem hiding this comment.
...and this works for csc2csr according to the documentation of cusparse<t>csr2csc():
Notice that this routine can also be used to convert a matrix in CSC format into a matrix in CSR format.
|
I think this can be treated as a compatibility bug and get tested. Will add tests later. |
|
Done. |
|
Jenkins, test this please |
|
Successfully created a job for commit 38abc5d: |
|
Jenkins CI test (for commit 38abc5d, target branch master) succeeded! |
|
Thanks all! 🙏 |
| cupyx.scipy.sparse.csc_matrix: Converted matrix. | ||
|
|
||
| """ | ||
| return self.T.tocsr().T |
There was a problem hiding this comment.
Leaving a note to my future self: This change is in line with SciPy that the conversion functions should be made symmetric (in this case tocsr vs tocsc) whenever a low-level library call is available to carry out the task. In this case, this change avoids two (unnecessary) constructor calls by not doing .T.
Main credit goes to @smarkesini (from his #2790).
Close #2790. Close #2664.
I don't know why this simple but great change was buried unnoticed 😢 Explained by @smarkesini in #2365 (comment):
UPDATE: While performance improvement was the original concern in this PR, note that this is also a compatibility bug -- In SciPy it is guaranteed that a conversion from COO to CSR/CSC will always lead to a canonical format, because the duplicates are summed and the indices are sorted.
and in #2365 (comment):As for performance boost, as mentioned in the discussion there the speed-up is multiple orders of magnitude, which I verified. I'll double check my benchmark script and post the result below.