Speed up CSR SpMV by orders of magnitude by leofang · Pull Request #3430 · cupy/cupy

leofang · 2020-06-11T09:34:27Z

Main credit goes to @smarkesini (from his #2790).

Close #2790. Close #2664.

I don't know why this simple but great change was buried unnoticed 😢 Explained by @smarkesini in #2365 (comment):

I suspect this issue has something to do with '_has_canonical_format' not being set even after calling coosort. There are several places where to set this flag, and I am not sure if the following is the best place, anyway I was focusing on csr and modified the last 3 lines in cupyx/scipy/sparse/coo.py 'tocsr' function as described below and got better performance for my problem (SpMV with a csr matrix ), >100x speedup on the first iteration.

UPDATE: While performance improvement was the original concern in this PR, note that this is also a compatibility bug -- In SciPy it is guaranteed that a conversion from COO to CSR/CSC will always lead to a canonical format, because the duplicates are summed and the indices are sorted.

~~and in #2365 (comment):~~

following the definition in there coo2csr, this function converts the array containing the uncompressed row indices (corresponding to COO format) into an array of compressed row pointers (corresponding to CSR format). In the cuda definition: "Sparse matrices in CSR format are assumed to be stored in row-major CSR format, in other words, the index arrays are first sorted by row indices and then within the same row by column indices. It is assumed that each pair of row and column indices appears only once."

As for performance boost, as mentioned in the discussion there the speed-up is multiple orders of magnitude, which I verified. I'll double check my benchmark script and post the result below.

leofang · 2020-06-11T09:37:17Z

TODO: compare with CUB SpMV, as I suspect they should be on par.

leofang · 2020-06-11T16:51:21Z

Benchmarking: (look at the last line in the output)

script: (click me)

import numpy as np
import cupy as cp
import cupyx
from cupyx.time import repeat
import scipy.sparse


# switch CUB on and off here
cp.cuda.cub_enabled = True

# matrix dimensions
M = np.int64(1e6)
N = M
nnz = M * 100


Krow=np.random.randint(0, high=M, size=nnz//2, dtype='l')
Kcol=np.random.randint(0, high=M, size=nnz//2, dtype='l')
# make duplicates
Krow=np.concatenate((Krow, Krow))
Kcol=np.concatenate((Kcol, Kcol))

Kval=np.random.rand(nnz).astype('float32')

# use scipy
S=scipy.sparse.csr_matrix((Kval.ravel(),    (Krow.ravel(), Kcol.ravel())), shape=(M,N))
ST=scipy.sparse.csc_matrix((Kval.ravel(),   (Kcol.ravel(), Krow.ravel())), shape=(N,M))

# vector to multiply
x=np.random.rand(M).astype('float32')

#print("left multiply (CSR)")
Sx = S * x
time_scipy = repeat(S.__mul__, (x,), n_repeat=100).cpu_times.mean()

#print("right multiply (CSC)")
xS = x * ST
time_scipyrm = repeat(x.__mul__, (ST,), n_repeat=100).cpu_times.mean()

print("allclose CPU:", np.allclose(Sx, xS)) #, "time:", time_scipy, time_scipyrm)

# use cupy
Krow=cp.asarray(Krow)
Kcol=cp.asarray(Kcol)
Kval=cp.asarray(Kval)
x   =cp.asarray(x)

S=cupyx.scipy.sparse.coo_matrix((Kval.ravel(),(Krow.ravel(), Kcol.ravel())), shape=(M,N))
S=cupyx.scipy.sparse.csr_matrix(S)
ST=cupyx.scipy.sparse.coo_matrix((Kval.ravel(), (Kcol.ravel(), Krow.ravel())), shape=(N,M))
ST=cupyx.scipy.sparse.csc_matrix(ST)

#print("left multiply by csr")
cSx = S * x
time_cupy1 = repeat(S.__mul__, (x,), n_repeat=100).gpu_times.mean()

#print("right multiply by csc")
cxS = x * ST
time_cupy1rm = repeat(x.__mul__, (ST,), n_repeat=100).gpu_times.mean()

xS=cp.asarray(xS)
Sx=cp.asarray(Sx)

print("allclose GPU:", cp.allclose(cSx, cxS))
print("allclose GPU-CPU: (rm) ", cp.allclose(cSx, xS),'(lm)', cp.allclose(cSx, Sx))


print("time scipy: lm:", time_scipy, "rm:", time_scipyrm)
print("time cupy: lm:", time_cupy1, "rm:", time_cupy1rm)

Before this PR:

time cupy: lm: 0.0012003951942920685 rm: 1.6158148022460939    # CUB is on
time cupy: lm: 0.0012251673579216002 rm: 1.8342621813964843    # CUB is off

With this PR (>1300x speedup!):

time cupy: lm: 0.0011969430339336395 rm: 0.0012273990404605866    # CUB is on
time cupy: lm: 0.0012167888033390046 rm: 0.0012395257604122165    # CUB is off

@grlee77 @cjnolet You might be interested.

leofang · 2020-06-11T17:06:44Z

See the benchmark with this PR added against CUB here: #2698 (comment).

grlee77 · 2020-06-11T18:43:05Z

cupyx/scipy/sparse/coo.py

            self._has_canonical_format = True
            return
-        keys = cupy.stack([self.col, self.row])
+        keys = cupy.stack([self.row, self.col])


Are you sure about this change? The description you quoted states:
"the index arrays are first sorted by row indices and then within the same row by column indices"

Since lexsort uses the last row as the primary sort key, didn't the already existing code match that description?

In either case, it seems it can be verified empirically against SciPy. If all test cases pass either way it seems that would indicate a hole in our test coverage!

I need your help, Gregory. I've been looking into this for a few hours, and I am super confused now.

It seems the change at this line affects neither speed nor correctness, at least with my test script. (And FYI, the whole test suite tests/cupyx_tests/scipy_tests/sparse_tests/ passed either way!) The only thing that matters (for speed) is to set _has_canonical_format=True, which makes sense because when calling sum_duplicates() both CSC & CSR converts to COO (that's why changes in coo.py affects CSR), and this flag would provide a fast path. Does it mean keys, lexsort, etc here are simply dead code? I am lost...

I haven't looked at it closely, but was just wandering what motivated the change. Ca n try to take a look tomorrow

I now incline that this was simply an overlook of previous PRs and that we don't need to change this line. After coo.sum_duplicates(), _has_canonical_format should always be True (in line with SciPy). The only concern is that cusparse.coosort() sorts rows by default, and for tocsc() it doesn't sound right, so I will address both issues.

The csc format is stored in column-major format, whereas csr is in row-major,
so I think a CSC matrix is stored internally as a transpose of a CSR matrix.

CSR/CSC matrix internally stores 3 arrays: data of size nnz, indices of size nnz, and indptr of size m+1/n+1. This is the same in both cuSPASE and SciPy. A transpose operation would need to update indices and indptr.

Revisiting this line, interestingly I found SciPy actually does what @smarkesini did (sort by col first): https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/sparse/coo.py#L543. I wonder why no test suite has caught this deviation so far (and how to write one)...

grlee77 · 2020-06-11T18:45:15Z

Wow, not sure how this got overlooked. That is an amazing improvement for very little change!

emcastillo · 2020-06-12T05:03:44Z

I don't know why I didn't look this in depth when the issue was created.
Sorry about it @smarkesini and thanks @leofang for bringing this back!

leofang · 2020-06-12T17:46:53Z

cupy/cusparse.py

+    cusparse.xcoo2csr(
+        handle, x.col.data.ptr, x.nnz, n,
+        indptr.data.ptr, cusparse.CUSPARSE_INDEX_BASE_ZERO)


This works for coo2csc according to the documentation of cusparse<t>coo2csr():

It can also be used to convert the array containing the uncompressed column indices (corresponding to COO format) into an array of column pointers (corresponding to CSC format).

leofang · 2020-06-12T17:50:09Z

I made the conversion from COO to CSC/CSR symmetric. This requires coosort to sort by row for CSR and by column for CSC, respectively.

leofang · 2020-06-12T18:05:56Z

cupyx/scipy/sparse/csr.py

        shape = self.shape[1], self.shape[0]
-        return csc.csc_matrix(
+        trans = csc.csc_matrix(
            (self.data, self.indices, self.indptr), shape=shape, copy=copy)
+        trans._has_canonical_format = self._has_canonical_format
+        return trans


I am still concerned with this transpose operation (and also for its csc counterpart). If I understand it correctly, self.indices and self.indptr should be updated during transpose. Am I missing something?

I think for the transpose operation as used in e.g. csrmv in cupy/cusparse.py ...cusparse.spMV(), you compute the multiplication directly with the transposed matrix using a _transpose_flag - without actually computing the transpose.

It seems SciPy does this too. I don't understand...

We left comments simultaneously, @smarkesini! It's good to know _transpose_flag -- I wasn't aware of that. But the transpose method here is for high-level operations, say, b=a.T.

Generally the sparse matrix operations follow the sparse BLAS conventions, with transpose operations computed internally, so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works.

I dug deeper into SciPy, and it turns out that for CSR:

tocsc will ensure the indices to be sorted, but the outcome is not necessary in the canonical form (i.e., can have repeated indices)

transpose will ensure nothing.

This observation also holds vice versa for CSC. So, the change here in csr.py is not necessary, and csc.py will need to be fixed.

so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works

Right, that's what I expected, but I dunno where this flag is...😅

I guess shape might be used for this purpose...?

leofang · 2020-06-12T18:34:06Z

Hi @emcastillo @grlee77 @smarkesini @cjnolet This is ready for review. The performance improvement is same as reported earlier. I left one question, though, which doesn't seem to be relevant to this PR (not sure why).

leofang · 2020-06-12T21:24:53Z

cupy/cusparse.py

+    _call_cusparse(
+        'csr2csc', x.dtype,
+        handle, n, m, nnz, x.data.data.ptr,
+        x.indptr.data.ptr, x.indices.data.ptr,
+        data.data.ptr, indices.data.ptr, indptr.data.ptr,
+        cusparse.CUSPARSE_ACTION_NUMERIC,
+        cusparse.CUSPARSE_INDEX_BASE_ZERO)


...and this works for csc2csr according to the documentation of cusparse<t>csr2csc():

Notice that this routine can also be used to convert a matrix in CSC format into a matrix in CSR format.

leofang · 2020-06-13T07:07:59Z

I think this can be treated as a compatibility bug and get tested. Will add tests later.

leofang · 2020-06-13T22:31:29Z

Done.

emcastillo · 2020-06-15T03:15:02Z

Jenkins, test this please

pfn-ci-bot · 2020-06-15T03:15:07Z

Successfully created a job for commit 38abc5d:

Dashboard for commit 38abc5d

emcastillo

LGTM

chainer-ci · 2020-06-15T04:08:20Z

Jenkins CI test (for commit 38abc5d, target branch master) succeeded!

leofang · 2020-06-15T04:12:58Z

Thanks all! 🙏

leofang · 2020-06-15T06:06:36Z

cupyx/scipy/sparse/coo.py

            cupyx.scipy.sparse.csc_matrix: Converted matrix.

        """
-        return self.T.tocsr().T


Leaving a note to my future self: This change is in line with SciPy that the conversion functions should be made symmetric (in this case tocsr vs tocsc) whenever a low-level library call is available to carry out the task. In this case, this change avoids two (unnecessary) constructor calls by not doing .T.

smarchesini@sigray.com and others added 2 commits June 11, 2020 04:31

speed up coo (from cupy#2790)

171663e

fix flake8; restore raised error type

e6df7e7

leofang mentioned this pull request Jun 11, 2020

CUB-based CSR sparse matrix vector multiply #2698

Merged

grlee77 reviewed Jun 11, 2020

View reviewed changes

kmaehashi assigned emcastillo Jun 12, 2020

kmaehashi added cat:performance Performance in terms of speed or memory consumption takeover Pull-requests taken over from other contributor labels Jun 12, 2020

kmaehashi added this to the v8.0.0b4 milestone Jun 12, 2020

leofang added 2 commits June 12, 2020 12:07

revert lexsort key order

aa9637f

more appropriate coo2csc conversion

0b77eac

leofang commented Jun 12, 2020

View reviewed changes

more has_canonical_format fixes

3a32dd0

leofang commented Jun 12, 2020

View reviewed changes

leofang added 2 commits June 12, 2020 14:25

add test for coosort

36bfbab

ensure _has_canonical_format=False for column sort

56a5240

add and use cusparse.csc2csr()

95d350c

smarkesini approved these changes Jun 12, 2020

View reviewed changes

leofang commented Jun 12, 2020

View reviewed changes

add tests

ef64c93

leofang added 2 commits June 14, 2020 00:56

remove redundant warnings; correct a typo

384c866

optimize csc2coo

38abc5d

emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Jun 15, 2020

emcastillo approved these changes Jun 15, 2020

View reviewed changes

mergify bot merged commit a105263 into cupy:master Jun 15, 2020

leofang deleted the coo branch June 15, 2020 04:12

leofang commented Jun 15, 2020

View reviewed changes

leofang mentioned this pull request Jun 15, 2020

Support CUDA 11.0 #3405

Merged

Uh oh!

Conversation

leofang commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 11, 2020

Uh oh!

grlee77 Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grlee77 commented Jun 11, 2020

Uh oh!

emcastillo commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Jun 12, 2020

Uh oh!

leofang Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Jun 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Jun 13, 2020

Uh oh!

leofang commented Jun 13, 2020

Uh oh!

emcastillo commented Jun 15, 2020

Uh oh!

pfn-ci-bot commented Jun 15, 2020

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

chainer-ci commented Jun 15, 2020

Uh oh!

leofang commented Jun 11, 2020 •

edited

Loading

leofang commented Jun 11, 2020 •

edited

Loading

leofang commented Jun 11, 2020 •

edited

Loading

grlee77 Jun 11, 2020 •

edited

Loading

leofang Jun 11, 2020 •

edited

Loading

leofang Jun 12, 2020 •

edited

Loading

leofang Jun 12, 2020 •

edited

Loading

leofang Jun 12, 2020 •

edited

Loading

leofang Jun 12, 2020 •

edited

Loading

leofang Jun 15, 2020 •

edited

Loading