Support slicing with out-of-order numpy arrays#3808
Support slicing with out-of-order numpy arrays#3808mrocklin wants to merge 1 commit intodask:masterfrom
Conversation
This stages a random slice into a numpy array into a two-staged slice in such a way that reduces overhead. We will now get, at worst, n-squared behavior. However, as currently implemented this also opens us up to all sorts of corner cases. We'll probably have to either do a bunch of work or else reduce the scope of when this gets applied.
In [1]: import dask.array as da
In [2]: x = da.random.random((10000, 10000), chunks=(1000, 1000))
In [3]: import numpy as np
In [4]: index = np.random.randint(0, len(x), size=len(x))
In [5]: len(x[index].dask)
/home/mrocklin/workspace/dask/dask/array/core.py:1398: PerformanceWarning: Slicing with an out-of-order index is generating 10 times more chunks
return self[index3].rechunk({axis: c})[index4]
Out[5]: 1300
In [6]: len(x.dask)
Out[6]: 100 |
|
Still costly, but feasible. This is also a case where we might consider auto-rechunking |
|
Thanks for working on this Matt. Will give it a closer look soon. Have one question ATM. What happens if you replace index = np.arange(10000)
np.random.shuffle(index) |
|
I think it'd be about the same |
| (slice(0, 5), index), | ||
| (slice(0, None, 2), index), | ||
| # (None, slice(None, None), index), | ||
| # (None, index, slice(None, None)), |
| return out | ||
|
|
||
|
|
||
| def shuffle_convert(index, chunks, axis): |
| locs = np.concatenate([[0], np.where(ind2[1:] < | ||
| ind2[:-1])[0] + 1, [len(ind2)]]) | ||
| c = tuple(np.diff(locs).tolist()) | ||
| return self[index3].rechunk({axis: c})[index4] |
There was a problem hiding this comment.
This whole block is quite complicated, and could use at least some block-level comments to explain what each operation is doing.
| # (None, index, slice(None, None)), | ||
| (index[::2],), | ||
| (slice(None, None), index[::2],), | ||
| (5, index)]: |
There was a problem hiding this comment.
Would be nice to have some cases like the following (with a higher dimension array):
(*mix_of_nones_slices_and_integers1, index, *mix_of_nones_slices_and_integers1)
to check that prior and subsequent axis are being applied correctly.
Would also be good to explicitly check that the error from providing multiple array indices still is raised with this optimization.
This stages a random slice into a numpy array into a two-staged slice
in such a way that reduces overhead. We will now get, at worst, n-squared
behavior.
However, as currently implemented this also opens us up to all sorts of
corner cases. We'll probably have to either do a bunch of work or else
reduce the scope of when this gets applied.
flake8 daskcc @jakirkham