Skip to content

P2P rechunking#9939

Merged
fjetter merged 11 commits intodask:mainfrom
hendrikmakait:rechunking
Feb 24, 2023
Merged

P2P rechunking#9939
fjetter merged 11 commits intodask:mainfrom
hendrikmakait:rechunking

Conversation

@hendrikmakait
Copy link
Copy Markdown
Member

@hendrikmakait hendrikmakait commented Feb 10, 2023

Sibling to dask/distributed#7534

@github-actions github-actions bot added the array label Feb 10, 2023
@hendrikmakait hendrikmakait marked this pull request as ready for review February 23, 2023 15:45
@hendrikmakait
Copy link
Copy Markdown
Member Author

For now, I'm just adding the keyword to rechunk(...) itself, no functions that use rechunk(...) under the hood.

threshold=None,
block_size_limit=None,
balance=False,
rechunk=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name for this. Naming is hard. I see three options

  1. rechunk (as you are proposing)
  2. algo (more generic but also more descriptive)
  3. shuffle (not great but it would offer some consistency; probably the worst option :D )

Copy link
Copy Markdown
Member Author

@hendrikmakait hendrikmakait Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chosen 1. for consistency with the shuffle keyword, which I'm not a huge fan of either.

The most descriptive might be
4. rechunk_{impl|algo}

This would be more descriptive and generalizes well to other methods that use a rechunk under the hood, similar to how we use shuffle in other functions as well. impl might give us slightly more freedom (I could see different implementations of a P2P-like algorithm, e.g., an RDMA-based one).

In that case, I'd even go as far as proposing to rename the shuffle keyword.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd even go as far as proposing to rename the shuffle keyword.

That's also on the table (with a deprecation cycle, of course)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -epsilon on "rechunk" and -lots on "shuffle". I think that "algorithm" or "implementation" ("rechunk_implementation" seems too long given the function name is already rechunk) are better choices.

I don't think that consistency with the shuffle keyword necessarily makes much sense: as noted, that's a bad choice too and it takes quite a lot of knowledge of the algorithm implementations to realise that these keywords are doing the same thing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now algorithm as a keyword within rechunk, which should be named rechunk_algorithm within users of rechunk. In the config, it's array.rechunk.algorithm.

Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a change happened as I was reading (and your crystal ball is good!)

threshold=None,
block_size_limit=None,
balance=False,
rechunk=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -epsilon on "rechunk" and -lots on "shuffle". I think that "algorithm" or "implementation" ("rechunk_implementation" seems too long given the function name is already rechunk) are better choices.

I don't think that consistency with the shuffle keyword necessarily makes much sense: as noted, that's a bad choice too and it takes quite a lot of knowledge of the algorithm implementations to realise that these keywords are doing the same thing.

if ndim <= 1 or not all(new_chunks) or any(has_nans):
# Trivial array / unknown dim => no need / ability for an intermediate
return steps + [new_chunks]
return [new_chunks]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe along with this cleanup, move the initialisation of steps to line 589. Since this short-circuiting also doesn't rely on any of the threshold and block_size_limit setup, you could fast-path this a bit earlier.

I would also inline the definitions of has_nans and ndim into the condition since those names are not subsequently used in the function.

if len(ndim) <= 1 or not all(new_chunks) or any(math.isnan(x) for x in chain.from_iterable(old_chunks)):
    ...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to inline has_nans since it would turn this into a three-line if statements that arguable harder to read. I've taken care of the other suggestions though.

@fjetter fjetter mentioned this pull request Feb 24, 2023
4 tasks
@fjetter fjetter merged commit a58a2a7 into dask:main Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants