mean of sparse array: AxisError: axis 0 is out of bounds for array of dimension 0 by ogrisel · Pull Request #2842 · dask/dask

ogrisel · 2017-10-30T10:20:52Z

s.mean(axis=0).compute() triggers the following uncaught exception:

________________________________________________________________________________________________ test_basic[func18] ________________________________________________________________________________________________

func = <function <lambda> at 0x7eff69e3ee18>

    @pytest.mark.parametrize('func', functions)
    def test_basic(func):
        x = da.random.random((2, 3, 4), chunks=(1, 2, 2))
        x[x < 0.8] = 0
    
        y = x.map_blocks(sparse.COO.from_numpy)
    
        xx = func(x)
        yy = func(y)
    
>       assert_eq(xx, yy)

func       = <function <lambda> at 0x7eff69e3ee18>
x          = dask.array<where, shape=(2, 3, 4), dtype=float64, chunksize=(1, 2, 2)>
xx         = dask.array<mean_agg-aggregate, shape=(3, 4), dtype=float64, chunksize=(2, 2)>
y          = dask.array<from_numpy, shape=(2, 3, 4), dtype=float64, chunksize=(1, 2, 2)>
yy         = dask.array<mean_agg-aggregate, shape=(3, 4), dtype=float64, chunksize=(2, 2)>

dask/array/tests/test_sparse.py:58: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/array/utils.py:83: in assert_eq
    b = b.compute(get=get_sync)
dask/base.py:138: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:336: in compute
    results = get(dsk, keys, **kwargs)
dask/local.py:562: in get_sync
    return get_async(apply_sync, 1, dsk, keys, **kwargs)
dask/local.py:529: in get_async
    fire_task()
dask/local.py:504: in fire_task
    callback=queue.put)
dask/local.py:551: in apply_sync
    res = func(*args, **kwds)
dask/local.py:295: in execute_task
    result = pack_exception(e, dumps)
dask/local.py:290: in execute_task
    result = _execute_task(task, data)
dask/local.py:271: in _execute_task
    return func(*args2)
dask/compatibility.py:47: in apply
    return func(*args, **kwargs)
dask/array/reductions.py:243: in mean_chunk
    n = numel(x, dtype=dtype, **kwargs)
dask/array/reductions.py:234: in numel
    return chunk.sum(np.ones_like(x), **kwargs)
../../.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py:1834: in sum
    out=out, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = array(1, dtype=object), axis = (0,), dtype = dtype('float64'), out = None, keepdims = True

    def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
>       return umr_sum(a, axis, dtype, out, keepdims)
E       numpy.core._internal.AxisError: axis 0 is out of bounds for array of dimension 0

a          = array(1, dtype=object)
axis       = (0,)
dtype      = dtype('float64')
keepdims   = True
out        = None

This PR adds a non-regression test along with a naive fix that calls todense() on each chunk during the reduction. This is probably sub-optimal from a performance point of view but I am not sure which kwargs should be supported in numel besides axis.

I will add a entry to the changelog if we agree on the correct fix.

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

mrocklin · 2017-10-30T14:50:38Z

dask/array/reductions.py

+        x_ones[:] = 1
+    else:
+        x_ones = np.ones_like(x)
+    return chunk.sum(x_ones, **kwargs)


Instead perhaps we replace np.ones_like with np.ones(shape=x.shape, dtype='u1') ?

I amended my commit with your suggestion. It can still allocate a lot of unnecessary memory if the arrays are very sparse and the chunk dimensions comparatively large.

Actually using u1 is wrong. We should probably use u8 to be able to count large dimensions.

Yes, ideally we would re-implement the axis and keepdims logic. We've been lazy so far.

U1 seems to work for me?

In [1]: import numpy as np In [2]: np.ones(shape=(1000, 2000), dtype='u1').sum() Out[2]: 2000000

There are still broken tests with masked arrays. Need to investigate.

jrbourbeau · 2019-06-12T20:41:36Z

Is this still an issue today @ogrisel? It looks like things might work on the current master branch

In [1]: import dask.array as da                                                                                       

In [2]: import sparse                                                                                                 

In [3]: x = da.random.random((100, 100), chunks=(10, 10))                                                             

In [4]: x[x < 0.95] = 0                                                                                               

In [5]: s = x.map_blocks(sparse.COO)                                                                                  

In [6]: s.mean(axis=0).compute()                                                                                      
Out[6]: <COO: shape=(100,), dtype=float64, nnz=99, fill_value=0.0>

jrbourbeau · 2019-06-14T16:32:26Z

Closing as the originally posted issue seems to be resolved. @ogrisel feel free to re-open if this is not the case.

Non-regression test for sparse_array.mean()

d408fe1

mrocklin reviewed Oct 30, 2017

View reviewed changes

ogrisel force-pushed the sparse-mean branch from 1dd7e3c to ca17bfc Compare October 30, 2017 15:09

Fix sparse_array.mean()

9409e15

ogrisel force-pushed the sparse-mean branch from ca17bfc to 9409e15 Compare October 30, 2017 15:24

ogrisel mentioned this pull request Jan 25, 2018

Sparse arrays support dask/dask-ml#123

Open

ogrisel mentioned this pull request Jul 11, 2018

Test that dask collections can hold scipy.sparse arrays #3738

Merged

2 tasks

jrbourbeau closed this Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mean of sparse array: AxisError: axis 0 is out of bounds for array of dimension 0#2842

mean of sparse array: AxisError: axis 0 is out of bounds for array of dimension 0#2842
ogrisel wants to merge 2 commits intodask:masterfrom
ogrisel:sparse-mean

ogrisel commented Oct 30, 2017

Uh oh!

mrocklin Oct 30, 2017

Uh oh!

ogrisel Oct 30, 2017

Uh oh!

ogrisel Oct 30, 2017

Uh oh!

ogrisel Oct 30, 2017

Uh oh!

mrocklin Oct 30, 2017

Uh oh!

ogrisel Oct 30, 2017

Uh oh!

jrbourbeau commented Jun 12, 2019

Uh oh!

jrbourbeau commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ogrisel commented Oct 30, 2017

Uh oh!

mrocklin Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Jun 12, 2019

Uh oh!

jrbourbeau commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants