Conversation
|
For now, I'm just adding the keyword to |
dask/array/core.py
Outdated
| threshold=None, | ||
| block_size_limit=None, | ||
| balance=False, | ||
| rechunk=None, |
There was a problem hiding this comment.
I don't love the name for this. Naming is hard. I see three options
- rechunk (as you are proposing)
- algo (more generic but also more descriptive)
- shuffle (not great but it would offer some consistency; probably the worst option :D )
There was a problem hiding this comment.
I've chosen 1. for consistency with the shuffle keyword, which I'm not a huge fan of either.
The most descriptive might be
4. rechunk_{impl|algo}
This would be more descriptive and generalizes well to other methods that use a rechunk under the hood, similar to how we use shuffle in other functions as well. impl might give us slightly more freedom (I could see different implementations of a P2P-like algorithm, e.g., an RDMA-based one).
In that case, I'd even go as far as proposing to rename the shuffle keyword.
There was a problem hiding this comment.
In that case, I'd even go as far as proposing to rename the shuffle keyword.
That's also on the table (with a deprecation cycle, of course)
There was a problem hiding this comment.
I am -epsilon on "rechunk" and -lots on "shuffle". I think that "algorithm" or "implementation" ("rechunk_implementation" seems too long given the function name is already rechunk) are better choices.
I don't think that consistency with the shuffle keyword necessarily makes much sense: as noted, that's a bad choice too and it takes quite a lot of knowledge of the algorithm implementations to realise that these keywords are doing the same thing.
There was a problem hiding this comment.
It's now algorithm as a keyword within rechunk, which should be named rechunk_algorithm within users of rechunk. In the config, it's array.rechunk.algorithm.
wence-
left a comment
There was a problem hiding this comment.
I see a change happened as I was reading (and your crystal ball is good!)
dask/array/core.py
Outdated
| threshold=None, | ||
| block_size_limit=None, | ||
| balance=False, | ||
| rechunk=None, |
There was a problem hiding this comment.
I am -epsilon on "rechunk" and -lots on "shuffle". I think that "algorithm" or "implementation" ("rechunk_implementation" seems too long given the function name is already rechunk) are better choices.
I don't think that consistency with the shuffle keyword necessarily makes much sense: as noted, that's a bad choice too and it takes quite a lot of knowledge of the algorithm implementations to realise that these keywords are doing the same thing.
| if ndim <= 1 or not all(new_chunks) or any(has_nans): | ||
| # Trivial array / unknown dim => no need / ability for an intermediate | ||
| return steps + [new_chunks] | ||
| return [new_chunks] |
There was a problem hiding this comment.
Maybe along with this cleanup, move the initialisation of steps to line 589. Since this short-circuiting also doesn't rely on any of the threshold and block_size_limit setup, you could fast-path this a bit earlier.
I would also inline the definitions of has_nans and ndim into the condition since those names are not subsequently used in the function.
if len(ndim) <= 1 or not all(new_chunks) or any(math.isnan(x) for x in chain.from_iterable(old_chunks)):
...There was a problem hiding this comment.
I'm hesitant to inline has_nans since it would turn this into a three-line if statements that arguable harder to read. I've taken care of the other suggestions though.
Sibling to dask/distributed#7534
pre-commit run --all-files