Skip to content

Partially revert chunk-splitting in indexing.#6665

Merged
mrocklin merged 5 commits intodask:masterfrom
TomAugspurger:index-slice-warn
Sep 25, 2020
Merged

Partially revert chunk-splitting in indexing.#6665
mrocklin merged 5 commits intodask:masterfrom
TomAugspurger:index-slice-warn

Conversation

@TomAugspurger
Copy link
Member

This partially reverts the changes made in
#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(array.slicing.split-large-chunks). Setting that to False silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
True silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes #6646

cc @JSKenyon from the issue. @jrbourbeau this would be nice to include in the 2.28 release if we're able to get it done today or tomorrow.

This partially reverts the changes made in
dask#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(`array.slicing.split-large-chunks`). Setting that to `False` silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
`True` silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes dask#6646
Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments about typos

>>> chunks, dsk = take('y', 'x', [(1, 1, 1), (1000, 1000), (1000, 1000)],
... [0] + [1] * 6 + [2], axis=0, itemsize=8)
>>> import dask
>>> with dask.config.set(**{"array.slicing.split-large-chunks": True}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> with dask.config.set(**{"array.slicing.split-large-chunks": True}):
>>> with dask.config.set({"array.slicing.split-large-chunks": True}):

I think that dask.config.set is smart enough to handle being given a dict as a positional arg

- bool
- "null"
description: |
How to large chunks created when slicing Arrays. By default a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
How to large chunks created when slicing Arrays. By default a
How to split large chunks created when slicing Arrays. By default a

>>> a = da.ones((4, 10000, 10000), chunks=(1, -1, -1))

If we slice that with a *sorted* sequence of integers, Dask will return one chunk
per intput chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
per intput chunk
per input chunk

Previously we had a chunksize of ``1`` along the first dimension. But we've
selected 15 elements from that first chunk, producing a large output chunk.

Dask warns when indexing like this produces a chunk that's 10x larger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 2x or 5x would be better? 10x seems like a large chunk to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the doc by the way. This was very informative to me (I only sort of tracked the previous conversation). I'm now curious to learn about situations where we don't want to split up these chunks, but I suppose that that's already covered in conversation on the issue, and I can go and read there.

Copy link
Member Author

@TomAugspurger TomAugspurger Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 5x, it's pretty arbitrary.

I'm now curious to learn about situations where we don't want to split up these chunks

It sounds like there might be a few issues. For xarray, some operations in a Dataset require uniform chunks. Since the chunk splitting logic depends on the itemsize of the array, you could end up with a call where

ds = load_dasets()  # int64 and int32 arrays, uniform chunks
ds.sort_by("column")  # eventually calls Array.__getitem__, splits according to itemsize
ds.<some_operation_requiring_uniform_chunks>  #  raises, since they've split to different sizes

@TomAugspurger
Copy link
Member Author

@jrbourbeau would you have a chance to glance over this before the release today?

@mrocklin
Copy link
Member

I'm going to go ahead and merge this for now. It should revert the old behavior, which will hopefully resolve issues downstream with xarray. However, in general I agree with the previous change that tried to split up these large blocks. I think that we might start applying pressure on downstream libraries to either be robust to uneven block sizes, or to explicitly rechunk in their algorithms if that is necessary. I think that the warning that is in this PR is the right amount of pressure. I suspect that this is going to be a continued conversation.

Thanks for responding to this issue rapidly @TomAugspurger . Merging.

@mrocklin mrocklin merged commit 588a212 into dask:master Sep 25, 2020
@jrbourbeau
Copy link
Member

Thanks for your work on this @TomAugspurger and reviewing @mrocklin!

@JSKenyon
Copy link
Contributor

Thanks @TomAugspurger and @mrocklin! Really appreciate this.

@TomAugspurger TomAugspurger deleted the index-slice-warn branch September 30, 2020 02:15
@TomAugspurger TomAugspurger restored the index-slice-warn branch September 30, 2020 02:15
@TomAugspurger TomAugspurger deleted the index-slice-warn branch September 30, 2020 02:15
@TomAugspurger TomAugspurger restored the index-slice-warn branch September 30, 2020 02:15
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
This partially reverts the changes made in
dask#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(`array.slicing.split-large-chunks`). Setting that to `False` silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
`True` silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes dask#6646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changes to chunking behaviour in dask==2.26.0

4 participants