improve support for np.matrix / scipy.sparse.spmatrix Array-blocks#7468
Closed
ryan-williams wants to merge 9 commits intodask:mainfrom
Closed
improve support for np.matrix / scipy.sparse.spmatrix Array-blocks#7468ryan-williams wants to merge 9 commits intodask:mainfrom
np.matrix / scipy.sparse.spmatrix Array-blocks#7468ryan-williams wants to merge 9 commits intodask:mainfrom
Conversation
3 tasks
Member
|
Hi @ryan-williams - since numpy no longer recommends using |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #xxxx(there's probably a few relevant issues, and it might be worth filing one or two just describing the bugs being fixed…)black dask/flake8 daskUsually Arrays' blocks materialize as
ndarrays, but in some cases they'll benp.matrixorscipy.sparse.spmatrix's. Some operations work as expected in these cases today, but this PR improves on a few that don't (.Aand.sum; concatenation/stacking of such blocks is also improved).Array.A.Agenerally coerces an array-like object to be an actualndarray. In the case of a DaskArray,_metaand the blocks' types should becomendarraywhen.Ais applied.The current behavior returns the
Array, unchanged. This is fine forndarray-blocks, but incorrect formatrix/spmatrix-blocks, where_meta(and the blocks themselves) remain asmatrix/spmatrix(instead ofndarray).The new behavior is that
_metaand the blocks becomendarrays.Array.sumArray.sumcrashes in the presence ofmatrix/spmatrixblocks today:keepdimsThis is because
matrix/spmatrixdon't conform tondarray'ssum()signature; they lack akeepdimsparam. Even though we're not usingkeepdimsdirectly in this example call, Dask still uses it under the hood while reducing a (potentially-high-dimensional)Arrayon a few dimensions: blocks are all summed withkeepdims=True(to preserve their dimensions), and those block-sums are combined withkeepdims=True, and the appropriate dimensions are collapsed as part of a final concatenation post-compute pass (I think this is how it works but a few details may be off).Anndatause-case forsumingspmatrix-blockArraysIt would be useful to perform
sums over sparse blocks in particular. As an example, Anndata frequently builds itsXarray from huge CSR or CSC matrices in HDF5 or Zarr files, and usingspmatrixblocks is impt when Dask-ifying this loading and downstream operations (e.g. summing across one axis or the other).Adding
keepdimstomatrix/spmatrixMany months ago I tried to fix this by adding the
keepdimsparam tomatrixandspmatrix. The existing behavior of those classes makes it tricky to do this without breaking back-compat, but celsiustx/numpy and celsiustx/scipy have implementations of it I think may have a shot at being upstreamed; some day I'll give that a go, but waiting on that (and assuming those changes are accepted) to make this not crash in Dask is not ideal. Using these numpy+scipy forks in Dask (and Anndata) in the meantime also required an unwieldy Docker build + dev environment.Work-around in Dask (this PR)
After that attempt to solve the issue upstream, I got more comfortable with existing block-type special-casing in Dask, and decided a work-around hack here is reasonable: a couple pieces of
computemachinery check formatrix/spmatrix_metaand patch some of the reduction flows accordingly. It's not ideal, but seems like the best option to me. atmPreserving
spmatrixsubtypesWhen Dask
computes a CSR or CSC, it ends up concat'ing blocks along axis 1 and then along axis 0.scipy.sparse's CSR and CSC each coerce to COO when stacked along their more-awkward axis (1 for CSR, 0 for CSC).The result is that you'd always get a COO back after a
computeon an Array that quacked like a CSR/CSC.I decided it's worth the hit to do an in-memory
tocsr()/tocsc()in Dask to avoid that surprise.That does create a different (but less bad, I think?) inconsistency with scipy.sparse, where e.g.
da.hstackcan actually return a CSR from a bunch of hstacked CSRs (where scipy.sparse would coerce to COO). I think it's better to err on the side of returning a more "precise" spmatrix type than losing that info when non-Dask version would retain it, but definitely open to discussion there.pydata/sparse vs. scipy/sparse
I'm also aware that pydata/sparse may be the recommended alternative to
scipy.sparse; it's not been clear to me yet that just dropping the latter and moving to the former would solve these issues, and in any case if there's enoughmatrix/spmatrixstill floating around it may be worth having this work-around anyway, but I'm definitely interested in folks' thoughts there. In particular I'm watching pydata/sparse#443 (thanks @ivirshup for bringing it to my attention, and cc @GenevieveBuckley as well for visibility here).