Skip to content

improve support for np.matrix / scipy.sparse.spmatrix Array-blocks#7468

Closed
ryan-williams wants to merge 9 commits intodask:mainfrom
celsiustx:sum2
Closed

improve support for np.matrix / scipy.sparse.spmatrix Array-blocks#7468
ryan-williams wants to merge 9 commits intodask:mainfrom
celsiustx:sum2

Conversation

@ryan-williams
Copy link
Contributor

  • Closes #xxxx (there's probably a few relevant issues, and it might be worth filing one or two just describing the bugs being fixed…)
  • Tests added / passed
  • Passes black dask / flake8 dask

Usually Arrays' blocks materialize as ndarrays, but in some cases they'll be np.matrix or scipy.sparse.spmatrix's. Some operations work as expected in these cases today, but this PR improves on a few that don't (.A and .sum; concatenation/stacking of such blocks is also improved).

Array.A

.A generally coerces an array-like object to be an actual ndarray. In the case of a Dask Array, _meta and the blocks' types should become ndarray when .A is applied.

The current behavior returns the Array, unchanged. This is fine for ndarray-blocks, but incorrect for matrix/spmatrix-blocks, where _meta (and the blocks themselves) remain as matrix/spmatrix (instead of ndarray).

The new behavior is that _meta and the blocks become ndarrays.

Array.sum

Array.sum crashes in the presence of matrix/spmatrix blocks today:

da \
.ones((5, 5), chunks=(2, 2)) \
.map_blocks(np.matrix) \
.sum(axis=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dask/array/core.py", line 2184, in sum
    return sum(
  File "dask/array/reductions.py", line 347, in sum
    result = reduction(
  File "dask/array/reductions.py", line 166, in reduction
    tmp = blockwise(
  File "dask/array/blockwise.py", line 279, in blockwise
    meta = compute_meta(func, dtype, *args[::2], **kwargs)
  File "dask/array/utils.py", line 145, in compute_meta
    meta = func(*args_meta, **kwargs_meta)
  File "<__array_function__ internals>", line 5, in sum
  File "/Users/ryan/.pyenv/versions/dask-3.8.1/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2247, in sum
    return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
  File "/Users/ryan/.pyenv/versions/dask-3.8.1/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 83, in _wrapreduction
    return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
TypeError: sum() got an unexpected keyword argument 'keepdims'

keepdims

This is because matrix/spmatrix don't conform to ndarray's sum() signature; they lack a keepdims param. Even though we're not using keepdims directly in this example call, Dask still uses it under the hood while reducing a (potentially-high-dimensional) Array on a few dimensions: blocks are all summed with keepdims=True (to preserve their dimensions), and those block-sums are combined with keepdims=True, and the appropriate dimensions are collapsed as part of a final concatenation post-compute pass (I think this is how it works but a few details may be off).

Anndata use-case for suming spmatrix-block Arrays

It would be useful to perform sums over sparse blocks in particular. As an example, Anndata frequently builds its X array from huge CSR or CSC matrices in HDF5 or Zarr files, and using spmatrix blocks is impt when Dask-ifying this loading and downstream operations (e.g. summing across one axis or the other).

Adding keepdims to matrix/spmatrix

Many months ago I tried to fix this by adding the keepdims param to matrix and spmatrix. The existing behavior of those classes makes it tricky to do this without breaking back-compat, but celsiustx/numpy and celsiustx/scipy have implementations of it I think may have a shot at being upstreamed; some day I'll give that a go, but waiting on that (and assuming those changes are accepted) to make this not crash in Dask is not ideal. Using these numpy+scipy forks in Dask (and Anndata) in the meantime also required an unwieldy Docker build + dev environment.

Work-around in Dask (this PR)

After that attempt to solve the issue upstream, I got more comfortable with existing block-type special-casing in Dask, and decided a work-around hack here is reasonable: a couple pieces of compute machinery check for matrix/spmatrix _meta and patch some of the reduction flows accordingly. It's not ideal, but seems like the best option to me. atm

Preserving spmatrix subtypes

When Dask computes a CSR or CSC, it ends up concat'ing blocks along axis 1 and then along axis 0.

scipy.sparse's CSR and CSC each coerce to COO when stacked along their more-awkward axis (1 for CSR, 0 for CSC).

The result is that you'd always get a COO back after a compute on an Array that quacked like a CSR/CSC.

I decided it's worth the hit to do an in-memory tocsr()/tocsc() in Dask to avoid that surprise.

That does create a different (but less bad, I think?) inconsistency with scipy.sparse, where e.g. da.hstack can actually return a CSR from a bunch of hstacked CSRs (where scipy.sparse would coerce to COO). I think it's better to err on the side of returning a more "precise" spmatrix type than losing that info when non-Dask version would retain it, but definitely open to discussion there.

pydata/sparse vs. scipy/sparse

I'm also aware that pydata/sparse may be the recommended alternative to scipy.sparse; it's not been clear to me yet that just dropping the latter and moving to the former would solve these issues, and in any case if there's enough matrix/spmatrix still floating around it may be worth having this work-around anyway, but I'm definitely interested in folks' thoughts there. In particular I'm watching pydata/sparse#443 (thanks @ivirshup for bringing it to my attention, and cc @GenevieveBuckley as well for visibility here).

@jsignell jsignell added the needs review Needs review from a contributor. label Feb 28, 2022
@jcrist
Copy link
Member

jcrist commented Mar 16, 2022

Hi @ryan-williams - since numpy no longer recommends using np.matrix types, and spmatrix has many of the same issues, we generally recommend users to use pydata/sparse instead. np.matrix-like classes will never support 100% of the functionality that dask.array exposes, adding little fixes to methods here and there may let some methods work, but won't provide a consistent enough experience for us to want to support it. We recommend using pydata/sparse instead for sparse support, and avoiding np.matrix in general. Closing.

@jcrist jcrist closed this Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array needs review Needs review from a contributor.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants