Skip to content

Speed up CSR SpMV by orders of magnitude#3430

Merged
mergify[bot] merged 11 commits intocupy:masterfrom
leofang:coo
Jun 15, 2020
Merged

Speed up CSR SpMV by orders of magnitude#3430
mergify[bot] merged 11 commits intocupy:masterfrom
leofang:coo

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Jun 11, 2020

Main credit goes to @smarkesini (from his #2790).

Close #2790. Close #2664.

I don't know why this simple but great change was buried unnoticed 😢 Explained by @smarkesini in #2365 (comment):

I suspect this issue has something to do with '_has_canonical_format' not being set even after calling coosort. There are several places where to set this flag, and I am not sure if the following is the best place, anyway I was focusing on csr and modified the last 3 lines in cupyx/scipy/sparse/coo.py 'tocsr' function as described below and got better performance for my problem (SpMV with a csr matrix ), >100x speedup on the first iteration.

UPDATE: While performance improvement was the original concern in this PR, note that this is also a compatibility bug -- In SciPy it is guaranteed that a conversion from COO to CSR/CSC will always lead to a canonical format, because the duplicates are summed and the indices are sorted.

and in #2365 (comment):

following the definition in there coo2csr, this function converts the array containing the uncompressed row indices (corresponding to COO format) into an array of compressed row pointers (corresponding to CSR format). In the cuda definition: "Sparse matrices in CSR format are assumed to be stored in row-major CSR format, in other words, the index arrays are first sorted by row indices and then within the same row by column indices. It is assumed that each pair of row and column indices appears only once."

As for performance boost, as mentioned in the discussion there the speed-up is multiple orders of magnitude, which I verified. I'll double check my benchmark script and post the result below.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 11, 2020

TODO: compare with CUB SpMV, as I suspect they should be on par.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 11, 2020

Benchmarking: (look at the last line in the output)

script: (click me)
import numpy as np
import cupy as cp
import cupyx
from cupyx.time import repeat
import scipy.sparse


# switch CUB on and off here
cp.cuda.cub_enabled = True

# matrix dimensions
M = np.int64(1e6)
N = M
nnz = M * 100


Krow=np.random.randint(0, high=M, size=nnz//2, dtype='l')
Kcol=np.random.randint(0, high=M, size=nnz//2, dtype='l')
# make duplicates
Krow=np.concatenate((Krow, Krow))
Kcol=np.concatenate((Kcol, Kcol))

Kval=np.random.rand(nnz).astype('float32')

# use scipy
S=scipy.sparse.csr_matrix((Kval.ravel(),    (Krow.ravel(), Kcol.ravel())), shape=(M,N))
ST=scipy.sparse.csc_matrix((Kval.ravel(),   (Kcol.ravel(), Krow.ravel())), shape=(N,M))

# vector to multiply
x=np.random.rand(M).astype('float32')

#print("left multiply (CSR)")
Sx = S * x
time_scipy = repeat(S.__mul__, (x,), n_repeat=100).cpu_times.mean()

#print("right multiply (CSC)")
xS = x * ST
time_scipyrm = repeat(x.__mul__, (ST,), n_repeat=100).cpu_times.mean()

print("allclose CPU:", np.allclose(Sx, xS)) #, "time:", time_scipy, time_scipyrm)

# use cupy
Krow=cp.asarray(Krow)
Kcol=cp.asarray(Kcol)
Kval=cp.asarray(Kval)
x   =cp.asarray(x)

S=cupyx.scipy.sparse.coo_matrix((Kval.ravel(),(Krow.ravel(), Kcol.ravel())), shape=(M,N))
S=cupyx.scipy.sparse.csr_matrix(S)
ST=cupyx.scipy.sparse.coo_matrix((Kval.ravel(), (Kcol.ravel(), Krow.ravel())), shape=(N,M))
ST=cupyx.scipy.sparse.csc_matrix(ST)

#print("left multiply by csr")
cSx = S * x
time_cupy1 = repeat(S.__mul__, (x,), n_repeat=100).gpu_times.mean()

#print("right multiply by csc")
cxS = x * ST
time_cupy1rm = repeat(x.__mul__, (ST,), n_repeat=100).gpu_times.mean()

xS=cp.asarray(xS)
Sx=cp.asarray(Sx)

print("allclose GPU:", cp.allclose(cSx, cxS))
print("allclose GPU-CPU: (rm) ", cp.allclose(cSx, xS),'(lm)', cp.allclose(cSx, Sx))


print("time scipy: lm:", time_scipy, "rm:", time_scipyrm)
print("time cupy: lm:", time_cupy1, "rm:", time_cupy1rm)

Before this PR:

time cupy: lm: 0.0012003951942920685 rm: 1.6158148022460939    # CUB is on
time cupy: lm: 0.0012251673579216002 rm: 1.8342621813964843    # CUB is off

With this PR (>1300x speedup!):

time cupy: lm: 0.0011969430339336395 rm: 0.0012273990404605866    # CUB is on
time cupy: lm: 0.0012167888033390046 rm: 0.0012395257604122165    # CUB is off

@grlee77 @cjnolet You might be interested.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 11, 2020

See the benchmark with this PR added against CUB here: #2698 (comment).

self._has_canonical_format = True
return
keys = cupy.stack([self.col, self.row])
keys = cupy.stack([self.row, self.col])
Copy link
Copy Markdown
Member

@grlee77 grlee77 Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this change? The description you quoted states:
"the index arrays are first sorted by row indices and then within the same row by column indices"

Since lexsort uses the last row as the primary sort key, didn't the already existing code match that description?

In either case, it seems it can be verified empirically against SciPy. If all test cases pass either way it seems that would indicate a hole in our test coverage!

Copy link
Copy Markdown
Member Author

@leofang leofang Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need your help, Gregory. I've been looking into this for a few hours, and I am super confused now.

It seems the change at this line affects neither speed nor correctness, at least with my test script. (And FYI, the whole test suite tests/cupyx_tests/scipy_tests/sparse_tests/ passed either way!) The only thing that matters (for speed) is to set _has_canonical_format=True, which makes sense because when calling sum_duplicates() both CSC & CSR converts to COO (that's why changes in coo.py affects CSR), and this flag would provide a fast path. Does it mean keys, lexsort, etc here are simply dead code? I am lost...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at it closely, but was just wandering what motivated the change. Ca n try to take a look tomorrow

Copy link
Copy Markdown
Member Author

@leofang leofang Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now incline that this was simply an overlook of previous PRs and that we don't need to change this line. After coo.sum_duplicates(), _has_canonical_format should always be True (in line with SciPy). The only concern is that cusparse.coosort() sorts rows by default, and for tocsc() it doesn't sound right, so I will address both issues.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csc format is stored in column-major format, whereas csr is in row-major,
so I think a CSC matrix is stored internally as a transpose of a CSR matrix.

Copy link
Copy Markdown
Member Author

@leofang leofang Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSR/CSC matrix internally stores 3 arrays: data of size nnz, indices of size nnz, and indptr of size m+1/n+1. This is the same in both cuSPASE and SciPy. A transpose operation would need to update indices and indptr.

Copy link
Copy Markdown
Member Author

@leofang leofang Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this line, interestingly I found SciPy actually does what @smarkesini did (sort by col first): https://github.com/scipy/scipy/blob/adc4f4f7bab120ccfab9383aba272954a0a12fb0/scipy/sparse/coo.py#L543. I wonder why no test suite has caught this deviation so far (and how to write one)...

@grlee77
Copy link
Copy Markdown
Member

grlee77 commented Jun 11, 2020

Wow, not sure how this got overlooked. That is an amazing improvement for very little change!

@kmaehashi kmaehashi added cat:performance Performance in terms of speed or memory consumption takeover Pull-requests taken over from other contributor labels Jun 12, 2020
@kmaehashi kmaehashi added this to the v8.0.0b4 milestone Jun 12, 2020
@emcastillo
Copy link
Copy Markdown
Member

I don't know why I didn't look this in depth when the issue was created.
Sorry about it @smarkesini and thanks @leofang for bringing this back!

Comment on lines +847 to +849
cusparse.xcoo2csr(
handle, x.col.data.ptr, x.nnz, n,
indptr.data.ptr, cusparse.CUSPARSE_INDEX_BASE_ZERO)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for coo2csc according to the documentation of cusparse<t>coo2csr():

It can also be used to convert the array containing the uncompressed column indices (corresponding to COO format) into an array of column pointers (corresponding to CSC format).

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 12, 2020

I made the conversion from COO to CSC/CSR symmetric. This requires coosort to sort by row for CSR and by column for CSC, respectively.

Comment on lines 297 to +301
shape = self.shape[1], self.shape[0]
return csc.csc_matrix(
trans = csc.csc_matrix(
(self.data, self.indices, self.indptr), shape=shape, copy=copy)
trans._has_canonical_format = self._has_canonical_format
return trans
Copy link
Copy Markdown
Member Author

@leofang leofang Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still concerned with this transpose operation (and also for its csc counterpart). If I understand it correctly, self.indices and self.indptr should be updated during transpose. Am I missing something?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the transpose operation as used in e.g. csrmv in cupy/cusparse.py ...cusparse.spMV(), you compute the multiplication directly with the transposed matrix using a _transpose_flag - without actually computing the transpose.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems SciPy does this too. I don't understand...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We left comments simultaneously, @smarkesini! It's good to know _transpose_flag -- I wasn't aware of that. But the transpose method here is for high-level operations, say, b=a.T.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the sparse matrix operations follow the sparse BLAS conventions, with transpose operations computed internally, so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug deeper into SciPy, and it turns out that for CSR:

  1. tocsc will ensure the indices to be sorted, but the outcome is not necessary in the canonical form (i.e., can have repeated indices)
  2. transpose will ensure nothing.

This observation also holds vice versa for CSC. So, the change here in csr.py is not necessary, and csc.py will need to be fixed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to get b=a.T you can just update the flag without sorting everything again; then as long as any operation applied to b is aware of the flag everything works

Right, that's what I expected, but I dunno where this flag is...😅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess shape might be used for this purpose...?

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 12, 2020

Hi @emcastillo @grlee77 @smarkesini @cjnolet This is ready for review. The performance improvement is same as reported earlier. I left one question, though, which doesn't seem to be relevant to this PR (not sure why).

Comment on lines +907 to +913
_call_cusparse(
'csr2csc', x.dtype,
handle, n, m, nnz, x.data.data.ptr,
x.indptr.data.ptr, x.indices.data.ptr,
data.data.ptr, indices.data.ptr, indptr.data.ptr,
cusparse.CUSPARSE_ACTION_NUMERIC,
cusparse.CUSPARSE_INDEX_BASE_ZERO)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and this works for csc2csr according to the documentation of cusparse<t>csr2csc():

Notice that this routine can also be used to convert a matrix in CSC format into a matrix in CSR format.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 13, 2020

I think this can be treated as a compatibility bug and get tested. Will add tests later.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 13, 2020

Done.

@emcastillo
Copy link
Copy Markdown
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 38abc5d:

@emcastillo emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Jun 15, 2020
Copy link
Copy Markdown
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 38abc5d, target branch master) succeeded!

@mergify mergify bot merged commit a105263 into cupy:master Jun 15, 2020
@leofang leofang deleted the coo branch June 15, 2020 04:12
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 15, 2020

Thanks all! 🙏

cupyx.scipy.sparse.csc_matrix: Converted matrix.

"""
return self.T.tocsr().T
Copy link
Copy Markdown
Member Author

@leofang leofang Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note to my future self: This change is in line with SciPy that the conversion functions should be made symmetric (in this case tocsr vs tocsc) whenever a low-level library call is available to carry out the task. In this case, this change avoids two (unnecessary) constructor calls by not doing .T.

@leofang leofang mentioned this pull request Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption st:test-and-merge (deprecated) Ready to merge after test pass. takeover Pull-requests taken over from other contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

slow SpMV Sparse matrix multiply orders of magnitude slower than PyTorch

7 participants