Skip to content

Sparse array reductions#9342

Merged
jrbourbeau merged 11 commits intodask:mainfrom
ian-r-rose:sparse-array-reductions
Aug 12, 2022
Merged

Sparse array reductions#9342
jrbourbeau merged 11 commits intodask:mainfrom
ian-r-rose:sparse-array-reductions

Conversation

@ian-r-rose
Copy link
Collaborator

This avoids creating dense auxiliary sparse.COO arrays in reductions, which will in general not have the same fill_value as actually sparse arrays, preventing basic arithmetic between them.

This works, but I'm a bit worried that it will cause problems for other array implementations like cupy, and it may be better to go through the dispatch mechanism.

@github-actions github-actions bot added the array label Aug 2, 2022
@ian-r-rose
Copy link
Collaborator Author

Okay, as I feared, this does indeed make cupy unhappy -- I'll work on pushing some of this into the dispatch system

@github-actions github-actions bot added the dispatch Related to `Dispatch` extension objects label Aug 2, 2022
Copy link
Collaborator Author

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is close-to-ready for some eyes.

I've added a new set of dispatches for numel and nannumel, since we want different behaviors for different backends, and doing tricky hasattr checks was feeling unsustainable. I haven't touched cupyx or scipy.sparse yet, they are not really covered by the existing test suite. But the GPU CI bot seems reasonably happy with this.

I'd especially appreciate some thoughts from anyone on the @dask/gpu team as to whether this is likely to cause any problems for cupy.

)
from dask.array.creation import arange, diagonal

# Keep empty_lookup here for backwards compatibility
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty_lookup didn't seem to be used anywhere, and that has been the case for over a year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up @ian-r-rose. Seems reasonable to remove.

I didn't see any reference to empty_lookup over in the cudf repo but cc @pentschev @jakirkham just in case for visibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am not aware of any usages of empty_lookup. Searching RAPIDS doesn't turn anything up either

)


@pytest.mark.xfail(reason="upstream change", strict=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were xfailed over four years ago -- it doesn't seem that sparse.COO is going to support mixed concatenation with numpy ndarrays, so I don't see the point in keeping these around

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair point. If this isn't going to be supported I agree we can remove these.

@hameerabbasi, just checking, do you think sparse.COO will support mixed concatenation with NumPy arrays?

return _numel(x, coerce_ndarray=False, **kwargs)


def _numel(x, coerce_ndarray: bool, **kwargs):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly the same as the existing numel function, but has an additional argument in coerce_ndarray, which, if true, forces the result to be an ndarray. If false, it lets np.full_like determine the output type, which will usually be the same as what we give it. We want coerce_ndarray for sparse arrays, and we don't want it for masked arrays, cupy, etc.

This is all private, so I think it's fairly safe (though I suppose I could obfuscate the kwarg name a bit more)

@ian-r-rose ian-r-rose marked this pull request as ready for review August 2, 2022 23:16
@pentschev
Copy link
Member

Thanks for the ping @ian-r-rose . As you noted yourself, unfortunately CuPy Sparse is largely uncovered at the moment, I may be mistaken but I think it's usage is fairly limited today, which is also why it didn't get much attention lately.

In briefly trying to increase coverage, I found that Dask arrays backed by cupyx.scipy.sparse.coo_matrix don't respect the chunktype, even if we pass meta=cupyx.scipy.sparse.coo_matrix((0,0)), and CSR matrices don't seem to respect chunktypes for matrices larger than 2 dimensions. I don't have enough bandwidth now to look further, @jakirkham is this something you would be interested/have bandwidth to look at? If not, then I'd suggest this PR may go in even without further CuPy testing.

@jrbourbeau jrbourbeau self-requested a review August 4, 2022 16:12
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose! Overall this looks great.

@jakirkham do you have any thoughts on Peter's comment here #9342 (comment)?

It looks like this also closes #8280?

)


@pytest.mark.xfail(reason="upstream change", strict=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair point. If this isn't going to be supported I agree we can remove these.

@hameerabbasi, just checking, do you think sparse.COO will support mixed concatenation with NumPy arrays?

)
from dask.array.creation import arange, diagonal

# Keep empty_lookup here for backwards compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up @ian-r-rose. Seems reasonable to remove.

I didn't see any reference to empty_lookup over in the cudf repo but cc @pentschev @jakirkham just in case for visibility

@ian-r-rose
Copy link
Collaborator Author

It looks like this also closes #8280?

I haven't really tackled that issue here -- it seems more involved and may require actual work on the algorithm (based on the WIP commit linked by the user)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose, I'll plan to merge this tomorrow if there are no further comments on #9342 (comment)

it seems more involved and may require actual work on the algorithm

I was just going off the reproducer in the OP passing with the changes in this PR

@ian-r-rose
Copy link
Collaborator Author

Just added one more commit adding some additional test coverage, in case you have a few minutes @jrbourbeau

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose!

@jrbourbeau jrbourbeau merged commit 8b95f98 into dask:main Aug 12, 2022
@jakirkham
Copy link
Member

Sorry for the lack of reply here. Responded in one thread above where I was pinged.

IIUC the other question is whether we want to support sparse CuPy matrices with numel? As CuPy sparse matrices are fairly similar to their SciPy sparse matrix counterparts, would ask whether SciPy sparse matrices are supported? If not, then wouldn't worry about CuPy. If someone asks about numel for sparse matrices, we can worry about it then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array dispatch Related to `Dispatch` extension objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sparse .var() getting wrong fill values

4 participants