Skip to content

Avoid large chunks from getitem with lists#6514

Merged
mrocklin merged 7 commits intodask:masterfrom
TomAugspurger:6270-indexer-chunks
Sep 9, 2020
Merged

Avoid large chunks from getitem with lists#6514
mrocklin merged 7 commits intodask:masterfrom
TomAugspurger:6270-indexer-chunks

Conversation

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 14, 2020

This changes Array[list_of_indices] to ensure that we don't generate a task graph that contains too large chunks (larger than array.chunk-size). The basic idea is to

  1. generate the slicing plan as usual
  2. Find chunks that exceed array.chunk-size, and split those

If people familiar with slicing (like @jakirkham) have time I'd appreciate their thoughts on where I did this. It could possibly be done in either take or slicing_plan, but take seemed best since it has access to all the necessary information (the slices generated by slicing_plan, and the sizes of all the other chunks.

Closes #6270

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a brief look. I don't understand this code well enough any more to have anything substantive to say though :/

@TomAugspurger
Copy link
Member Author

Thanks for the quick review. Just to clarify one thing: is

I don't understand this code well enough

referring to the changes here, or the slicing code in general. I'm happy to spend time cleaning up these changes to make things simpler if desired.

@mrocklin
Copy link
Member

mrocklin commented Aug 17, 2020 via email

@TomAugspurger
Copy link
Member Author

@jrbourbeau would you have a chance to glance through here?

@mrocklin
Copy link
Member

mrocklin commented Aug 18, 2020 via email

@TomAugspurger
Copy link
Member Author

Fixed the merge conflict. cc @jcrist if you have a chance to review this (no rush).

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, just one small question.

# Check for chunks from the plan that would violate the user's
# configured chunk size.
nbytes = utils.parse_bytes(config.get("array.chunk-size"))
other_chunks = [chunks[i] for i in range(len(chunks)) if i != axis]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path might never be hit with unknown chunk sizes, but if it is, are unknown chunks properly handled (and if so, could you add a test for this case)?

Copy link
Member Author

@TomAugspurger TomAugspurger Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's not handled "correctly" in the sense that we avoid creating a large block, since we don't actually know what the size in bytes is. But it does at least compute, as long as you aren't slicing on an axis with unknown chunk sizes (that case hits #6586).

I've added a test to ensure that we at least cover this case.

@TomAugspurger
Copy link
Member Author

I think that Jim's comment has been addressed. cc @dask/maintenance.

@mrocklin
Copy link
Member

mrocklin commented Sep 9, 2020

Sorry for the long delay in response here @TomAugspurger . Let's go ahead and merge and hope for the best :)

@mrocklin mrocklin merged commit 99ea30d into dask:master Sep 9, 2020
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Sep 24, 2020
This partially reverts the changes made in
dask#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(`array.slicing.split-large-chunks`). Setting that to `False` silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
`True` silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes dask#6646
mrocklin pushed a commit that referenced this pull request Sep 25, 2020
This partially reverts the changes made in
#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(`array.slicing.split-large-chunks`). Setting that to `False` silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
`True` silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes #6646
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
This partially reverts the changes made in
dask#6514. It restores the old behavior
with a warning that large chunks (10x array.chunk-size) are being
produced.

Additionally, it adds a new config option to control the behavior
(`array.slicing.split-large-chunks`). Setting that to `False` silences
the warnings and keeps the "old" behavior (one output block per input
block touched, even if this makes a large output). Setting that to
`True` silences the warning and restores the Dask 2.26 behavior of
splitting.

Closes dask#6646
@TomAugspurger TomAugspurger deleted the 6270-indexer-chunks branch September 7, 2021 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Array.__getitem__ should consider array.chunk-size for concrete indexers?

3 participants