Convention for automatic rechunking

There are a few operations where we can (and possibly should) meaningfully rechunk arrays for users:

1.  atop operations that include outer products, like tensordot
2.  linear algebra operations like qr and svd
3.  fft, where we need active dimensions to have a single chunk
4.  apply_gufunc, where we need core dimensions to have a single chunk
5.  from_array ?
6.  ...

In all of these cases we generally want to make a few dimensions have as few chunks as possible (often one) and then sometimes make other dimensions as large as possible while respecting chunk size memory limits, while still enabling some parallelism.  This can strongly affect performance.

For some operations like `fft` we currently raise informative errors that encourage users to do the right thing.  For other operations like `atop` we don't necessarily want to err, but do want to encourage users to do the right thing.  We might choose to do some of this automatically.  If so, we should do it consistently across all operations in order to avoid API diversity.

I'll propose a `rechunk=` keyword with the following semantics:

1. `rechunk=False` does no rechunking.
2.  `rechunk=True` tells the operation to do whatever it wants to perform automatic rechunking
3.  `rechunk='10MB'` or other string tells the operation to do whatever it wants to perform automatic rechunking, targetting chunk sizes of 10MB.  Otherwise we default to `dask.config.get('array.chunk-size')` as is done now.

I'll also suggest that we leave it to the operations whether or not they choose to do automatic rechunking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convention for automatic rechunking #3587

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Convention for automatic rechunking #3587

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions