Skip to content

Support slicing with out-of-order numpy arrays#3808

Closed
mrocklin wants to merge 1 commit intodask:masterfrom
mrocklin:slice-shuffle
Closed

Support slicing with out-of-order numpy arrays#3808
mrocklin wants to merge 1 commit intodask:masterfrom
mrocklin:slice-shuffle

Conversation

@mrocklin
Copy link
Member

This stages a random slice into a numpy array into a two-staged slice
in such a way that reduces overhead. We will now get, at worst, n-squared
behavior.

However, as currently implemented this also opens us up to all sorts of
corner cases. We'll probably have to either do a bunch of work or else
reduce the scope of when this gets applied.

  • Tests added / passed
  • Passes flake8 dask

cc @jakirkham

This stages a random slice into a numpy array into a two-staged slice
in such a way that reduces overhead.  We will now get, at worst, n-squared
behavior.

However, as currently implemented this also opens us up to all sorts of
corner cases.  We'll probably have to either do a bunch of work or else
reduce the scope of when this gets applied.
@mrocklin
Copy link
Member Author

In [1]: import dask.array as da
In [2]: x = da.random.random((10000, 10000), chunks=(1000, 1000))
In [3]: import numpy as np
In [4]: index = np.random.randint(0, len(x), size=len(x))

In [5]: len(x[index].dask)
/home/mrocklin/workspace/dask/dask/array/core.py:1398: PerformanceWarning: Slicing with an out-of-order index is generating 10 times more chunks
  return self[index3].rechunk({axis: c})[index4]
Out[5]: 1300

In [6]: len(x.dask)
Out[6]: 100

@mrocklin
Copy link
Member Author

Still costly, but feasible. This is also a case where we might consider auto-rechunking

@jakirkham
Copy link
Member

Thanks for working on this Matt. Will give it a closer look soon.

Have one question ATM. What happens if you replace index with the following?

index = np.arange(10000)
np.random.shuffle(index)

@mrocklin
Copy link
Member Author

I think it'd be about the same

(slice(0, 5), index),
(slice(0, None, 2), index),
# (None, slice(None, None), index),
# (None, index, slice(None, None)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these commented out?

return out


def shuffle_convert(index, chunks, axis):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a docstring

locs = np.concatenate([[0], np.where(ind2[1:] <
ind2[:-1])[0] + 1, [len(ind2)]])
c = tuple(np.diff(locs).tolist())
return self[index3].rechunk({axis: c})[index4]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block is quite complicated, and could use at least some block-level comments to explain what each operation is doing.

# (None, index, slice(None, None)),
(index[::2],),
(slice(None, None), index[::2],),
(5, index)]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have some cases like the following (with a higher dimension array):

(*mix_of_nones_slices_and_integers1, index, *mix_of_nones_slices_and_integers1)

to check that prior and subsequent axis are being applied correctly.

Would also be good to explicitly check that the error from providing multiple array indices still is raised with this optimization.

@mrocklin mrocklin closed this Jan 3, 2019
@mrocklin mrocklin deleted the slice-shuffle branch January 3, 2019 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants