Skip to content

Convention for automatic rechunking #3587

@mrocklin

Description

@mrocklin

There are a few operations where we can (and possibly should) meaningfully rechunk arrays for users:

  1. atop operations that include outer products, like tensordot
  2. linear algebra operations like qr and svd
  3. fft, where we need active dimensions to have a single chunk
  4. apply_gufunc, where we need core dimensions to have a single chunk
  5. from_array ?
  6. ...

In all of these cases we generally want to make a few dimensions have as few chunks as possible (often one) and then sometimes make other dimensions as large as possible while respecting chunk size memory limits, while still enabling some parallelism. This can strongly affect performance.

For some operations like fft we currently raise informative errors that encourage users to do the right thing. For other operations like atop we don't necessarily want to err, but do want to encourage users to do the right thing. We might choose to do some of this automatically. If so, we should do it consistently across all operations in order to avoid API diversity.

I'll propose a rechunk= keyword with the following semantics:

  1. rechunk=False does no rechunking.
  2. rechunk=True tells the operation to do whatever it wants to perform automatic rechunking
  3. rechunk='10MB' or other string tells the operation to do whatever it wants to perform automatic rechunking, targetting chunk sizes of 10MB. Otherwise we default to dask.config.get('array.chunk-size') as is done now.

I'll also suggest that we leave it to the operations whether or not they choose to do automatic rechunking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions