Skip to content

mean of sparse array: AxisError: axis 0 is out of bounds for array of dimension 0#2842

Closed
ogrisel wants to merge 2 commits intodask:masterfrom
ogrisel:sparse-mean
Closed

mean of sparse array: AxisError: axis 0 is out of bounds for array of dimension 0#2842
ogrisel wants to merge 2 commits intodask:masterfrom
ogrisel:sparse-mean

Conversation

@ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Oct 30, 2017

s.mean(axis=0).compute() triggers the following uncaught exception:

________________________________________________________________________________________________ test_basic[func18] ________________________________________________________________________________________________

func = <function <lambda> at 0x7eff69e3ee18>

    @pytest.mark.parametrize('func', functions)
    def test_basic(func):
        x = da.random.random((2, 3, 4), chunks=(1, 2, 2))
        x[x < 0.8] = 0
    
        y = x.map_blocks(sparse.COO.from_numpy)
    
        xx = func(x)
        yy = func(y)
    
>       assert_eq(xx, yy)

func       = <function <lambda> at 0x7eff69e3ee18>
x          = dask.array<where, shape=(2, 3, 4), dtype=float64, chunksize=(1, 2, 2)>
xx         = dask.array<mean_agg-aggregate, shape=(3, 4), dtype=float64, chunksize=(2, 2)>
y          = dask.array<from_numpy, shape=(2, 3, 4), dtype=float64, chunksize=(1, 2, 2)>
yy         = dask.array<mean_agg-aggregate, shape=(3, 4), dtype=float64, chunksize=(2, 2)>

dask/array/tests/test_sparse.py:58: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/array/utils.py:83: in assert_eq
    b = b.compute(get=get_sync)
dask/base.py:138: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:336: in compute
    results = get(dsk, keys, **kwargs)
dask/local.py:562: in get_sync
    return get_async(apply_sync, 1, dsk, keys, **kwargs)
dask/local.py:529: in get_async
    fire_task()
dask/local.py:504: in fire_task
    callback=queue.put)
dask/local.py:551: in apply_sync
    res = func(*args, **kwds)
dask/local.py:295: in execute_task
    result = pack_exception(e, dumps)
dask/local.py:290: in execute_task
    result = _execute_task(task, data)
dask/local.py:271: in _execute_task
    return func(*args2)
dask/compatibility.py:47: in apply
    return func(*args, **kwargs)
dask/array/reductions.py:243: in mean_chunk
    n = numel(x, dtype=dtype, **kwargs)
dask/array/reductions.py:234: in numel
    return chunk.sum(np.ones_like(x), **kwargs)
../../.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py:1834: in sum
    out=out, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = array(1, dtype=object), axis = (0,), dtype = dtype('float64'), out = None, keepdims = True

    def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
>       return umr_sum(a, axis, dtype, out, keepdims)
E       numpy.core._internal.AxisError: axis 0 is out of bounds for array of dimension 0

a          = array(1, dtype=object)
axis       = (0,)
dtype      = dtype('float64')
keepdims   = True
out        = None

This PR adds a non-regression test along with a naive fix that calls todense() on each chunk during the reduction. This is probably sub-optimal from a performance point of view but I am not sure which kwargs should be supported in numel besides axis.

I will add a entry to the changelog if we agree on the correct fix.

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

x_ones[:] = 1
else:
x_ones = np.ones_like(x)
return chunk.sum(x_ones, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead perhaps we replace np.ones_like with np.ones(shape=x.shape, dtype='u1') ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended my commit with your suggestion. It can still allocate a lot of unnecessary memory if the arrays are very sparse and the chunk dimensions comparatively large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually using u1 is wrong. We should probably use u8 to be able to count large dimensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we would re-implement the axis and keepdims logic. We've been lazy so far.

U1 seems to work for me?

In [1]: import numpy as np

In [2]: np.ones(shape=(1000, 2000), dtype='u1').sum()
Out[2]: 2000000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still broken tests with masked arrays. Need to investigate.

@jrbourbeau
Copy link
Member

Is this still an issue today @ogrisel? It looks like things might work on the current master branch

In [1]: import dask.array as da                                                                                       

In [2]: import sparse                                                                                                 

In [3]: x = da.random.random((100, 100), chunks=(10, 10))                                                             

In [4]: x[x < 0.95] = 0                                                                                               

In [5]: s = x.map_blocks(sparse.COO)                                                                                  

In [6]: s.mean(axis=0).compute()                                                                                      
Out[6]: <COO: shape=(100,), dtype=float64, nnz=99, fill_value=0.0>

@jrbourbeau
Copy link
Member

Closing as the originally posted issue seems to be resolved. @ogrisel feel free to re-open if this is not the case.

@jrbourbeau jrbourbeau closed this Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants