improve support for `np.matrix` / `scipy.sparse.spmatrix` Array-blocks by ryan-williams · Pull Request #7468 · dask/dask

ryan-williams · 2021-03-26T03:52:58Z

~~Closes #xxxx~~ (there's probably a few relevant issues, and it might be worth filing one or two just describing the bugs being fixed…)
Tests added / passed
Passes black dask / flake8 dask

Usually Arrays' blocks materialize as ndarrays, but in some cases they'll be np.matrix or scipy.sparse.spmatrix's. Some operations work as expected in these cases today, but this PR improves on a few that don't (.A and .sum; concatenation/stacking of such blocks is also improved).

`Array.A`

.A generally coerces an array-like object to be an actual ndarray. In the case of a Dask Array, _meta and the blocks' types should become ndarray when .A is applied.

The current behavior returns the Array, unchanged. This is fine for ndarray-blocks, but incorrect for matrix/spmatrix-blocks, where _meta (and the blocks themselves) remain as matrix/spmatrix (instead of ndarray).

The new behavior is that _meta and the blocks become ndarrays.

`Array.sum`

Array.sum crashes in the presence of matrix/spmatrix blocks today:

da \
.ones((5, 5), chunks=(2, 2)) \
.map_blocks(np.matrix) \
.sum(axis=1)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dask/array/core.py", line 2184, in sum
    return sum(
  File "dask/array/reductions.py", line 347, in sum
    result = reduction(
  File "dask/array/reductions.py", line 166, in reduction
    tmp = blockwise(
  File "dask/array/blockwise.py", line 279, in blockwise
    meta = compute_meta(func, dtype, *args[::2], **kwargs)
  File "dask/array/utils.py", line 145, in compute_meta
    meta = func(*args_meta, **kwargs_meta)
  File "<__array_function__ internals>", line 5, in sum
  File "/Users/ryan/.pyenv/versions/dask-3.8.1/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2247, in sum
    return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
  File "/Users/ryan/.pyenv/versions/dask-3.8.1/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 83, in _wrapreduction
    return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
TypeError: sum() got an unexpected keyword argument 'keepdims'

`keepdims`

This is because matrix/spmatrix don't conform to ndarray's sum() signature; they lack a keepdims param. Even though we're not using keepdims directly in this example call, Dask still uses it under the hood while reducing a (potentially-high-dimensional) Array on a few dimensions: blocks are all summed with keepdims=True (to preserve their dimensions), and those block-sums are combined with keepdims=True, and the appropriate dimensions are collapsed as part of a final concatenation post-compute pass (I think this is how it works but a few details may be off).

`Anndata` use-case for `sum`ing `spmatrix`-block `Array`s

It would be useful to perform sums over sparse blocks in particular. As an example, Anndata frequently builds its X array from huge CSR or CSC matrices in HDF5 or Zarr files, and using spmatrix blocks is impt when Dask-ifying this loading and downstream operations (e.g. summing across one axis or the other).

Adding `keepdims` to `matrix`/`spmatrix`

Many months ago I tried to fix this by adding the keepdims param to matrix and spmatrix. The existing behavior of those classes makes it tricky to do this without breaking back-compat, but celsiustx/numpy and celsiustx/scipy have implementations of it I think may have a shot at being upstreamed; some day I'll give that a go, but waiting on that (and assuming those changes are accepted) to make this not crash in Dask is not ideal. Using these numpy+scipy forks in Dask (and Anndata) in the meantime also required an unwieldy Docker build + dev environment.

Work-around in Dask (this PR)

After that attempt to solve the issue upstream, I got more comfortable with existing block-type special-casing in Dask, and decided a work-around hack here is reasonable: a couple pieces of compute machinery check for matrix/spmatrix _meta and patch some of the reduction flows accordingly. It's not ideal, but seems like the best option to me. atm

Preserving `spmatrix` subtypes

When Dask computes a CSR or CSC, it ends up concat'ing blocks along axis 1 and then along axis 0.

scipy.sparse's CSR and CSC each coerce to COO when stacked along their more-awkward axis (1 for CSR, 0 for CSC).

The result is that you'd always get a COO back after a compute on an Array that quacked like a CSR/CSC.

I decided it's worth the hit to do an in-memory tocsr()/tocsc() in Dask to avoid that surprise.

That does create a different (but less bad, I think?) inconsistency with scipy.sparse, where e.g. da.hstack can actually return a CSR from a bunch of hstacked CSRs (where scipy.sparse would coerce to COO). I think it's better to err on the side of returning a more "precise" spmatrix type than losing that info when non-Dask version would retain it, but definitely open to discussion there.

pydata/sparse vs. scipy/sparse

I'm also aware that pydata/sparse may be the recommended alternative to scipy.sparse; it's not been clear to me yet that just dropping the latter and moving to the former would solve these issues, and in any case if there's enough matrix/spmatrix still floating around it may be worth having this work-around anyway, but I'm definitely interested in folks' thoughts there. In particular I'm watching pydata/sparse#443 (thanks @ivirshup for bringing it to my attention, and cc @GenevieveBuckley as well for visibility here).

jcrist · 2022-03-16T15:40:05Z

Hi @ryan-williams - since numpy no longer recommends using np.matrix types, and spmatrix has many of the same issues, we generally recommend users to use pydata/sparse instead. np.matrix-like classes will never support 100% of the functionality that dask.array exposes, adding little fixes to methods here and there may let some methods work, but won't provide a consistent enough experience for us to want to support it. We recommend using pydata/sparse instead for sparse support, and avoiding np.matrix in general. Closing.

ryan-williams force-pushed the sum2 branch from 58677d3 to 52b3a5c Compare March 26, 2021 14:17

ryan-williams force-pushed the sum2 branch from 52b3a5c to 578e9f1 Compare April 13, 2021 22:23

github-actions bot added the array label Apr 13, 2021

ryan-williams force-pushed the sum2 branch from 578e9f1 to 4c86d23 Compare April 13, 2021 23:51

ryan-williams mentioned this pull request Apr 14, 2021

ddf.iloc part 1: add partition_sizes member (typically unset), testing utils #7396

Closed

3 tasks

ryan-williams added 9 commits May 19, 2021 20:55

test Array.A w/ np.matrix, spmatrix

f1e8685

np.matrix concatenate in compute()

067c656

test np.matrix blocks round trip

4f6b1ae

fix Array.A for non-ndarray blocks

6a7a9f2

let Array._meta be a coo_matrix

59c9644

more asserts around spmatrix concat behavior

d04516b

skip keepdims-squeezing for np.matrix, scipy spmatrix chunktypes

de16023

preserve spmatrix type through concatenate2

5f1091d

more spmatrix concat, stack, sum assertions

5b8b6e6

ryan-williams force-pushed the sum2 branch from 4c86d23 to 5b8b6e6 Compare May 20, 2021 03:28

jsignell added the needs review Needs review from a contributor. label Feb 28, 2022

jcrist closed this Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve support for `np.matrix` / `scipy.sparse.spmatrix` Array-blocks#7468

improve support for `np.matrix` / `scipy.sparse.spmatrix` Array-blocks#7468
ryan-williams wants to merge 9 commits intodask:mainfrom
celsiustx:sum2

ryan-williams commented Mar 26, 2021

Uh oh!

jcrist commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ryan-williams commented Mar 26, 2021

Array.A

Array.sum

keepdims

Anndata use-case for suming spmatrix-block Arrays

Adding keepdims to matrix/spmatrix

Work-around in Dask (this PR)

Preserving spmatrix subtypes

pydata/sparse vs. scipy/sparse

Uh oh!

jcrist commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`Array.A`

`Array.sum`

`keepdims`

`Anndata` use-case for `sum`ing `spmatrix`-block `Array`s

Adding `keepdims` to `matrix`/`spmatrix`

Preserving `spmatrix` subtypes