Partially revert chunk-splitting in indexing. by TomAugspurger · Pull Request #6665 · dask/dask

TomAugspurger · 2020-09-24T13:20:36Z

This partially reverts the changes made in
#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(array.slicing.split-large-chunks). Setting that to False silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
True silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes #6646

cc @JSKenyon from the issue. @jrbourbeau this would be nice to include in the 2.28 release if we're able to get it done today or tomorrow.

This partially reverts the changes made in dask#6514. It restores the old behavior with a warning that large chunks (10x array.chunk-size) are being produced. Additionally, it adds a new config option to control the behavior (`array.slicing.split-large-chunks`). Setting that to `False` silences the warnings and keeps the "old" behavior (one output block per input block touched, even if this makes a large output). Setting that to `True` silences the warning and restores the Dask 2.26 behavior of splitting. Closes dask#6646

mrocklin

Some minor comments about typos

mrocklin · 2020-09-24T15:32:37Z

dask/array/slicing.py

-    >>> chunks, dsk = take('y', 'x', [(1, 1, 1), (1000, 1000), (1000, 1000)],
-    ...                    [0] + [1] * 6 + [2], axis=0, itemsize=8)
+    >>> import dask
+    >>> with dask.config.set(**{"array.slicing.split-large-chunks": True}):


Suggested change

>>> with dask.config.set(**{"array.slicing.split-large-chunks": True}):

>>> with dask.config.set({"array.slicing.split-large-chunks": True}):

I think that dask.config.set is smart enough to handle being given a dict as a positional arg

mrocklin · 2020-09-24T15:53:17Z

dask/dask-schema.yaml

+              - bool
+              - "null"
+            description: |
+              How to large chunks created when slicing Arrays. By default a


Suggested change

How to large chunks created when slicing Arrays. By default a

How to split large chunks created when slicing Arrays. By default a

mrocklin · 2020-09-24T15:53:53Z

docs/source/array-slicing.rst

+   >>> a = da.ones((4, 10000, 10000), chunks=(1, -1, -1))
+
+If we slice that with a *sorted* sequence of integers, Dask will return one chunk
+per intput chunk


Suggested change

per intput chunk

per input chunk

mrocklin · 2020-09-24T15:55:07Z

docs/source/array-slicing.rst

+Previously we had a chunksize of ``1`` along the first dimension. But we've
+selected 15 elements from that first chunk, producing a large output chunk.
+
+Dask warns when indexing like this produces a chunk that's 10x larger


Maybe 2x or 5x would be better? 10x seems like a large chunk to me.

Thank you for the doc by the way. This was very informative to me (I only sort of tracked the previous conversation). I'm now curious to learn about situations where we don't want to split up these chunks, but I suppose that that's already covered in conversation on the issue, and I can go and read there.

Changed to 5x, it's pretty arbitrary.

I'm now curious to learn about situations where we don't want to split up these chunks

It sounds like there might be a few issues. For xarray, some operations in a Dataset require uniform chunks. Since the chunk splitting logic depends on the itemsize of the array, you could end up with a call where

ds = load_dasets() # int64 and int32 arrays, uniform chunks ds.sort_by("column") # eventually calls Array.__getitem__, splits according to itemsize ds.<some_operation_requiring_uniform_chunks> # raises, since they've split to different sizes

TomAugspurger · 2020-09-25T20:00:39Z

@jrbourbeau would you have a chance to glance over this before the release today?

mrocklin · 2020-09-25T21:25:00Z

I'm going to go ahead and merge this for now. It should revert the old behavior, which will hopefully resolve issues downstream with xarray. However, in general I agree with the previous change that tried to split up these large blocks. I think that we might start applying pressure on downstream libraries to either be robust to uneven block sizes, or to explicitly rechunk in their algorithms if that is necessary. I think that the warning that is in this PR is the right amount of pressure. I suspect that this is going to be a continued conversation.

Thanks for responding to this issue rapidly @TomAugspurger . Merging.

jrbourbeau · 2020-09-25T22:42:04Z

Thanks for your work on this @TomAugspurger and reviewing @mrocklin!

JSKenyon · 2020-09-28T09:23:22Z

Thanks @TomAugspurger and @mrocklin! Really appreciate this.

This partially reverts the changes made in dask#6514. It restores the old behavior with a warning that large chunks (10x array.chunk-size) are being produced. Additionally, it adds a new config option to control the behavior (`array.slicing.split-large-chunks`). Setting that to `False` silences the warnings and keeps the "old" behavior (one output block per input block touched, even if this makes a large output). Setting that to `True` silences the warning and restores the Dask 2.26 behavior of splitting. Closes dask#6646

TomAugspurger added 2 commits September 24, 2020 08:17

pr number

8b685ed

mrocklin reviewed Sep 24, 2020

View reviewed changes

TomAugspurger added 3 commits September 24, 2020 14:31

fixup

4fe703b

docstring

f9a44e7

doc

e77162e

mrocklin merged commit 588a212 into dask:master Sep 25, 2020

TomAugspurger deleted the index-slice-warn branch September 30, 2020 02:15

TomAugspurger restored the index-slice-warn branch September 30, 2020 02:15

TomAugspurger deleted the index-slice-warn branch September 30, 2020 02:15

TomAugspurger restored the index-slice-warn branch September 30, 2020 02:15

This was referenced Sep 30, 2020

failing dask test napari/napari#1656

Closed

2.28.0 performance related issues #6694

Closed

TomAugspurger mentioned this pull request Oct 15, 2020

Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0 pydata/xarray#4428

Closed

TomAugspurger mentioned this pull request Sep 7, 2021

Limit max chunk size when reshaping dask arrays #8124

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partially revert chunk-splitting in indexing.#6665

Partially revert chunk-splitting in indexing.#6665
mrocklin merged 5 commits intodask:masterfrom
TomAugspurger:index-slice-warn

TomAugspurger commented Sep 24, 2020

Uh oh!

mrocklin left a comment

Uh oh!

mrocklin Sep 24, 2020

Uh oh!

mrocklin Sep 24, 2020

Uh oh!

mrocklin Sep 24, 2020

Uh oh!

mrocklin Sep 24, 2020

Uh oh!

mrocklin Sep 24, 2020

Uh oh!

TomAugspurger Sep 25, 2020 •

edited

Loading

Uh oh!

TomAugspurger commented Sep 25, 2020

Uh oh!

mrocklin commented Sep 25, 2020

Uh oh!

jrbourbeau commented Sep 25, 2020

Uh oh!

JSKenyon commented Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	>>> with dask.config.set(**{"array.slicing.split-large-chunks": True}):
	>>> with dask.config.set({"array.slicing.split-large-chunks": True}):

	How to large chunks created when slicing Arrays. By default a
	How to split large chunks created when slicing Arrays. By default a

Uh oh!

Conversation

TomAugspurger commented Sep 24, 2020

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Sep 25, 2020

Uh oh!

mrocklin commented Sep 25, 2020

Uh oh!

jrbourbeau commented Sep 25, 2020

Uh oh!

JSKenyon commented Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomAugspurger Sep 25, 2020 •

edited

Loading