API: shuffle dask array by TomAugspurger · Pull Request #3901 · dask/dask

TomAugspurger · 2018-08-24T21:44:10Z

Some brief timings on

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)

task	old	new
build graph	419 ms	28.7 ms
compute	14.5 s	48.9 ms

Closes dask#3409

TomAugspurger · 2018-08-24T21:48:51Z

IIUC, In #3409, @mrocklin mentioned adjusting slicing_plan to detect when we should use this slicing method. I've not attempted that. So a "naive" shuffle of a dask array with

index = np.arange(len(arr))
np.random.shuffle(index)
arr[index]

is still going to be very slow. But the use cases I have in mind (#3409, lmcinnes/umap#62, approximate nearest neighbors) can opt into the faster slicing, when we know we have the right kind of index array.

mrocklin · 2018-08-24T22:38:05Z

There might be some valuable code in #3808 for this operation.

…

On Fri, Aug 24, 2018 at 5:51 PM, Tom Augspurger ***@***.***> wrote: IIUC, In #3409 <#3409>, @mrocklin <https://github.com/mrocklin> mentioned adjusting slicing_plan to detect when we should use this slicing method. I've not attempted that. So a "naive" shuffle of a dask array with index = np.arange(len(arr)) np.random.shuffle(index) arr[index] is still going to be very slow. But the use cases I have in mind (#3409 <#3409>, lmcinnes/umap#62 <lmcinnes/umap#62>, approximate nearest neighbors) can opt into the faster slicing, when we know we have the right kind of index array. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3901 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszH4-CYKCifEkO4W718ooUu6y19Vdks5uUHVLgaJpZM4WMCLW> .

TomAugspurger · 2018-08-26T17:48:50Z

Thanks, I had forgotten about that. I'll take a look.

jakirkham · 2018-08-27T23:35:49Z

dask/array/slicing.py

+
+    offsets = np.roll(np.cumsum(chunks[0]), 1)
+    offsets[0] = 0
+    offsets


Is this needed?

jcrist · 2019-04-30T21:55:30Z

@TomAugspurger, what's left to be done here?

TomAugspurger · 2019-05-01T03:05:09Z

Sorry, forgot about this. I think that this is useful, so I've fixed the merge conflicts. I haven't re-reviewed the implementation though.

jakirkham · 2019-05-29T01:29:05Z

@shoyer, any thoughts on this implementation of shuffle for Dask Arrays?

martindurant · 2019-06-19T13:24:47Z

Ping: this seems to have been left to go stale

TomAugspurger · 2019-06-19T13:27:33Z

The difference in inplace vs. a new Array is the main thing concerning me. Do we have other places in dask.array that differ from NumPy like this?

martindurant · 2019-06-19T13:31:40Z

So long as the doc is clear, it should be OK - we are different from numpy in a number of ways in a number of places; I expect that includes in-place behaviour somewhere, although I don't know for sure.

dask/array/random.py

dask/array/tests/test_random.py

jcrist · 2019-06-25T16:23:30Z

The difference in inplace vs. a new Array is the main thing concerning me.

There are other places in the api where we mutate an existing array/dataframe object inplace (this is fine as long as we don't mutate the graph). I'd prefer we match numpy's mutating api here if possible, as I suspect differing will lead to user issues in the future.

TomAugspurger · 2019-07-01T18:39:26Z

Merging later today if there aren't any objections.

jakirkham · 2019-08-08T15:51:34Z

Thanks for working on this Tom! 😄

stsievert · 2019-08-30T02:34:33Z

Thanks for working on this @TomAugspurger!

I've rerun the timing comparison to see how the new implementation works and show how to use it.

Implementation	Graph build	Computation
`shuffle_blocks` (this PR)	77.8ms	67.1 ms
Naive indexing	721ms	13.007s

Here's the code I used to generate it:

Details

import dask.array as da
import numpy as np
import dask
from time import time
from dask.array.slicing import shuffle_slice

if __name__ == "__main__":
    x = da.random.random((100_000, 10), chunks=10_000)
    index = np.arange(len(x))
    np.random.shuffle(index)

    start = time()
    y2 = shuffle_slice(x, index)  # 0.07785s
    print(time() - start)
    start = time()
    z2 = y2.compute()  # 0.06716
    print(time() - start)

    start = time()
    y1 = x[index]  # 0.721
    print(time() - start)
    start = time()
    z1 = y1.compute()  # 13.0067
    print(time() - start)

The core of this code is

x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)

y1 = shuffle_blocks(x, index)  # shuffle_blocks
y2 = x[index]  # naive indexing

TomAugspurger added 2 commits August 24, 2018 16:28

API: shuffle dask array

3780677

Closes dask#3409

api docs

fa03909

docs, warnings

12a3f3b

doctest

b3f25ed

jakirkham reviewed Aug 27, 2018

View reviewed changes

dask/array/slicing.py Outdated

offsets = np.roll(np.cumsum(chunks[0]), 1)

offsets[0] = 0

offsets

Copy link

Member

jakirkham Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

TomAugspurger added 2 commits April 30, 2019 22:00

Merge remote-tracking branch 'upstream/master' into ndarray-shuffle

8c28755

remove line

0fcc827

TomAugspurger added 4 commits May 24, 2019 16:57

Merge remote-tracking branch 'upstream/master' into ndarray-shuffle

bc9effc

lint

260bf43

revert

1bc5e83

doc

0f8f788

shoyer reviewed Jun 24, 2019

View reviewed changes

dask/array/random.py Outdated Show resolved Hide resolved

dask/array/tests/test_random.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits June 26, 2019 09:51

Merge remote-tracking branch 'upstream' into ndarray-shuffle

9f2a3d9

update

ec0db9f

Merge remote-tracking branch 'upstream/master' into ndarray-shuffle

8ee16d8

TomAugspurger merged commit 51ff4e6 into dask:master Aug 8, 2019

TomAugspurger added the array label Aug 8, 2019

stsievert mentioned this pull request Aug 30, 2019

Partial fit / Incremental UMAP lmcinnes/umap#62

Open

TomAugspurger deleted the ndarray-shuffle branch August 30, 2019 11:41

dcherian mentioned this pull request Jul 31, 2024

Implement task-based array shuffle #11262

Merged

3 tasks

Uh oh!

Conversation

TomAugspurger commented Aug 24, 2018

Uh oh!

TomAugspurger commented Aug 24, 2018

Uh oh!

mrocklin commented Aug 24, 2018 via email

Uh oh!

TomAugspurger commented Aug 26, 2018

Uh oh!

jakirkham Aug 27, 2018

Choose a reason for hiding this comment

Uh oh!

jcrist commented Apr 30, 2019

Uh oh!

TomAugspurger commented May 1, 2019

Uh oh!

jakirkham commented May 29, 2019

Uh oh!

martindurant commented Jun 19, 2019

Uh oh!

TomAugspurger commented Jun 19, 2019

Uh oh!

martindurant commented Jun 19, 2019

Uh oh!

Uh oh!

Uh oh!

jcrist commented Jun 25, 2019

Uh oh!

TomAugspurger commented Jul 1, 2019

Uh oh!

jakirkham commented Aug 8, 2019

Uh oh!

stsievert commented Aug 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants