-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
There are a few operations where we can (and possibly should) meaningfully rechunk arrays for users:
- atop operations that include outer products, like tensordot
- linear algebra operations like qr and svd
- fft, where we need active dimensions to have a single chunk
- apply_gufunc, where we need core dimensions to have a single chunk
- from_array ?
- ...
In all of these cases we generally want to make a few dimensions have as few chunks as possible (often one) and then sometimes make other dimensions as large as possible while respecting chunk size memory limits, while still enabling some parallelism. This can strongly affect performance.
For some operations like fft we currently raise informative errors that encourage users to do the right thing. For other operations like atop we don't necessarily want to err, but do want to encourage users to do the right thing. We might choose to do some of this automatically. If so, we should do it consistently across all operations in order to avoid API diversity.
I'll propose a rechunk= keyword with the following semantics:
rechunk=Falsedoes no rechunking.rechunk=Truetells the operation to do whatever it wants to perform automatic rechunkingrechunk='10MB'or other string tells the operation to do whatever it wants to perform automatic rechunking, targetting chunk sizes of 10MB. Otherwise we default todask.config.get('array.chunk-size')as is done now.
I'll also suggest that we leave it to the operations whether or not they choose to do automatic rechunking.