Skip to content

API: shuffle dask array#3901

Merged
TomAugspurger merged 13 commits intodask:masterfrom
TomAugspurger:ndarray-shuffle
Aug 8, 2019
Merged

API: shuffle dask array#3901
TomAugspurger merged 13 commits intodask:masterfrom
TomAugspurger:ndarray-shuffle

Conversation

@TomAugspurger
Copy link
Member

Closes #3409

Some brief timings on

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)
task old new
build graph 419 ms 28.7 ms
compute 14.5 s 48.9 ms

@TomAugspurger
Copy link
Member Author

IIUC, In #3409, @mrocklin mentioned adjusting slicing_plan to detect when we should use this slicing method. I've not attempted that. So a "naive" shuffle of a dask array with

index = np.arange(len(arr))
np.random.shuffle(index)
arr[index]

is still going to be very slow. But the use cases I have in mind (#3409, lmcinnes/umap#62, approximate nearest neighbors) can opt into the faster slicing, when we know we have the right kind of index array.

@mrocklin
Copy link
Member

mrocklin commented Aug 24, 2018 via email

@TomAugspurger
Copy link
Member Author

Thanks, I had forgotten about that. I'll take a look.


offsets = np.roll(np.cumsum(chunks[0]), 1)
offsets[0] = 0
offsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

@jcrist
Copy link
Member

jcrist commented Apr 30, 2019

@TomAugspurger, what's left to be done here?

@TomAugspurger
Copy link
Member Author

Sorry, forgot about this. I think that this is useful, so I've fixed the merge conflicts. I haven't re-reviewed the implementation though.

@jakirkham
Copy link
Member

@shoyer, any thoughts on this implementation of shuffle for Dask Arrays?

@martindurant
Copy link
Member

Ping: this seems to have been left to go stale

@TomAugspurger
Copy link
Member Author

The difference in inplace vs. a new Array is the main thing concerning me. Do we have other places in dask.array that differ from NumPy like this?

@martindurant
Copy link
Member

So long as the doc is clear, it should be OK - we are different from numpy in a number of ways in a number of places; I expect that includes in-place behaviour somewhere, although I don't know for sure.

@jcrist
Copy link
Member

jcrist commented Jun 25, 2019

The difference in inplace vs. a new Array is the main thing concerning me.

There are other places in the api where we mutate an existing array/dataframe object inplace (this is fine as long as we don't mutate the graph). I'd prefer we match numpy's mutating api here if possible, as I suspect differing will lead to user issues in the future.

@TomAugspurger
Copy link
Member Author

Merging later today if there aren't any objections.

@TomAugspurger TomAugspurger merged commit 51ff4e6 into dask:master Aug 8, 2019
@jakirkham
Copy link
Member

Thanks for working on this Tom! 😄

@stsievert
Copy link
Member

Thanks for working on this @TomAugspurger!

I've rerun the timing comparison to see how the new implementation works and show how to use it.

Implementation Graph build Computation
shuffle_blocks (this PR) 77.8ms 67.1 ms
Naive indexing 721ms 13.007s

Here's the code I used to generate it:

Details
import dask.array as da
import numpy as np
import dask
from time import time
from dask.array.slicing import shuffle_slice

if __name__ == "__main__":
    x = da.random.random((100_000, 10), chunks=10_000)
    index = np.arange(len(x))
    np.random.shuffle(index)

    start = time()
    y2 = shuffle_slice(x, index)  # 0.07785s
    print(time() - start)
    start = time()
    z2 = y2.compute()  # 0.06716
    print(time() - start)

    start = time()
    y1 = x[index]  # 0.721
    print(time() - start)
    start = time()
    z1 = y1.compute()  # 13.0067
    print(time() - start)

The core of this code is

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)

y1 = shuffle_blocks(x, index)  # shuffle_blocks
y2 = x[index]  # naive indexing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add RandomState shuffle/permute

7 participants