Partially revert chunk-splitting in indexing.#6665
Conversation
This partially reverts the changes made in dask#6514. It restores the old behavior with a warning that large chunks (10x array.chunk-size) are being produced. Additionally, it adds a new config option to control the behavior (`array.slicing.split-large-chunks`). Setting that to `False` silences the warnings and keeps the "old" behavior (one output block per input block touched, even if this makes a large output). Setting that to `True` silences the warning and restores the Dask 2.26 behavior of splitting. Closes dask#6646
mrocklin
left a comment
There was a problem hiding this comment.
Some minor comments about typos
dask/array/slicing.py
Outdated
| >>> chunks, dsk = take('y', 'x', [(1, 1, 1), (1000, 1000), (1000, 1000)], | ||
| ... [0] + [1] * 6 + [2], axis=0, itemsize=8) | ||
| >>> import dask | ||
| >>> with dask.config.set(**{"array.slicing.split-large-chunks": True}): |
There was a problem hiding this comment.
| >>> with dask.config.set(**{"array.slicing.split-large-chunks": True}): | |
| >>> with dask.config.set({"array.slicing.split-large-chunks": True}): |
I think that dask.config.set is smart enough to handle being given a dict as a positional arg
dask/dask-schema.yaml
Outdated
| - bool | ||
| - "null" | ||
| description: | | ||
| How to large chunks created when slicing Arrays. By default a |
There was a problem hiding this comment.
| How to large chunks created when slicing Arrays. By default a | |
| How to split large chunks created when slicing Arrays. By default a |
docs/source/array-slicing.rst
Outdated
| >>> a = da.ones((4, 10000, 10000), chunks=(1, -1, -1)) | ||
|
|
||
| If we slice that with a *sorted* sequence of integers, Dask will return one chunk | ||
| per intput chunk |
There was a problem hiding this comment.
| per intput chunk | |
| per input chunk |
docs/source/array-slicing.rst
Outdated
| Previously we had a chunksize of ``1`` along the first dimension. But we've | ||
| selected 15 elements from that first chunk, producing a large output chunk. | ||
|
|
||
| Dask warns when indexing like this produces a chunk that's 10x larger |
There was a problem hiding this comment.
Maybe 2x or 5x would be better? 10x seems like a large chunk to me.
There was a problem hiding this comment.
Thank you for the doc by the way. This was very informative to me (I only sort of tracked the previous conversation). I'm now curious to learn about situations where we don't want to split up these chunks, but I suppose that that's already covered in conversation on the issue, and I can go and read there.
There was a problem hiding this comment.
Changed to 5x, it's pretty arbitrary.
I'm now curious to learn about situations where we don't want to split up these chunks
It sounds like there might be a few issues. For xarray, some operations in a Dataset require uniform chunks. Since the chunk splitting logic depends on the itemsize of the array, you could end up with a call where
ds = load_dasets() # int64 and int32 arrays, uniform chunks
ds.sort_by("column") # eventually calls Array.__getitem__, splits according to itemsize
ds.<some_operation_requiring_uniform_chunks> # raises, since they've split to different sizes|
@jrbourbeau would you have a chance to glance over this before the release today? |
|
I'm going to go ahead and merge this for now. It should revert the old behavior, which will hopefully resolve issues downstream with xarray. However, in general I agree with the previous change that tried to split up these large blocks. I think that we might start applying pressure on downstream libraries to either be robust to uneven block sizes, or to explicitly rechunk in their algorithms if that is necessary. I think that the warning that is in this PR is the right amount of pressure. I suspect that this is going to be a continued conversation. Thanks for responding to this issue rapidly @TomAugspurger . Merging. |
|
Thanks for your work on this @TomAugspurger and reviewing @mrocklin! |
|
Thanks @TomAugspurger and @mrocklin! Really appreciate this. |
This partially reverts the changes made in dask#6514. It restores the old behavior with a warning that large chunks (10x array.chunk-size) are being produced. Additionally, it adds a new config option to control the behavior (`array.slicing.split-large-chunks`). Setting that to `False` silences the warnings and keeps the "old" behavior (one output block per input block touched, even if this makes a large output). Setting that to `True` silences the warning and restores the Dask 2.26 behavior of splitting. Closes dask#6646
This partially reverts the changes made in
#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.
Additionally, it adds a new config option to control the behavior
(
array.slicing.split-large-chunks). Setting that toFalsesilencesthe warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
Truesilences the warning and restores the Dask 2.26 behavior ofsplitting.
Closes #6646
cc @JSKenyon from the issue. @jrbourbeau this would be nice to include in the 2.28 release if we're able to get it done today or tomorrow.