Conversation
|
IIUC, In #3409, @mrocklin mentioned adjusting index = np.arange(len(arr))
np.random.shuffle(index)
arr[index]is still going to be very slow. But the use cases I have in mind (#3409, lmcinnes/umap#62, approximate nearest neighbors) can opt into the faster slicing, when we know we have the right kind of |
|
There might be some valuable code in #3808
for this operation.
…On Fri, Aug 24, 2018 at 5:51 PM, Tom Augspurger ***@***.***> wrote:
IIUC, In #3409 <#3409>, @mrocklin
<https://github.com/mrocklin> mentioned adjusting slicing_plan to detect
when we should use this slicing method. I've not attempted that. So a
"naive" shuffle of a dask array with
index = np.arange(len(arr))
np.random.shuffle(index)
arr[index]
is still going to be very slow. But the use cases I have in mind (#3409
<#3409>, lmcinnes/umap#62
<lmcinnes/umap#62>, approximate nearest
neighbors) can opt into the faster slicing, when we know we have the right
kind of index array.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3901 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszH4-CYKCifEkO4W718ooUu6y19Vdks5uUHVLgaJpZM4WMCLW>
.
|
|
Thanks, I had forgotten about that. I'll take a look. |
dask/array/slicing.py
Outdated
|
|
||
| offsets = np.roll(np.cumsum(chunks[0]), 1) | ||
| offsets[0] = 0 | ||
| offsets |
|
@TomAugspurger, what's left to be done here? |
|
Sorry, forgot about this. I think that this is useful, so I've fixed the merge conflicts. I haven't re-reviewed the implementation though. |
|
@shoyer, any thoughts on this implementation of shuffle for Dask Arrays? |
|
Ping: this seems to have been left to go stale |
|
The difference in inplace vs. a new Array is the main thing concerning me. Do we have other places in dask.array that differ from NumPy like this? |
|
So long as the doc is clear, it should be OK - we are different from numpy in a number of ways in a number of places; I expect that includes in-place behaviour somewhere, although I don't know for sure. |
There are other places in the api where we mutate an existing array/dataframe object inplace (this is fine as long as we don't mutate the graph). I'd prefer we match numpy's mutating api here if possible, as I suspect differing will lead to user issues in the future. |
|
Merging later today if there aren't any objections. |
|
Thanks for working on this Tom! 😄 |
|
Thanks for working on this @TomAugspurger! I've rerun the timing comparison to see how the new implementation works and show how to use it.
Here's the code I used to generate it: Detailsimport dask.array as da
import numpy as np
import dask
from time import time
from dask.array.slicing import shuffle_slice
if __name__ == "__main__":
x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)
start = time()
y2 = shuffle_slice(x, index) # 0.07785s
print(time() - start)
start = time()
z2 = y2.compute() # 0.06716
print(time() - start)
start = time()
y1 = x[index] # 0.721
print(time() - start)
start = time()
z1 = y1.compute() # 13.0067
print(time() - start)The core of this code is x = da.random.random((100_000, 10), chunks=10_000)
index = np.arange(len(x))
np.random.shuffle(index)
y1 = shuffle_blocks(x, index) # shuffle_blocks
y2 = x[index] # naive indexing |
Closes #3409
Some brief timings on