Scheduler fail case: centering data with dask.array

A common use case for many modeling problems (e.g., in machine learning or climate science) is to center data by subtracting an average of some kind over a given axis. The dask scheduler currently falls flat on its face when attempting to schedule these types of problems.

Here's a simple example of such a fail case:

``` python
import dask.array as da
x = da.ones((8, 200, 200), chunks=(1, 200, 200))  # e.g., a large stack of image data
mad = abs(x - x.mean(axis=0)).mean()
mad.visualize()
```

![image](https://cloud.githubusercontent.com/assets/1217238/11696975/18d78eb2-9e6c-11e5-8ca1-8741ca2e38a9.png)

The scheduler will keep each of the initial chunks in memory that it uses to compute the mean, because they will be used later to as an argument to `sub`. In contrast, the appropriate way to handle this graph to avoid blowing up memory would be to compute the initial chunks twice.

I know that in principle this could be avoided by using an on-disk cache. But this seems like a waste, because the initial values are often sitting in a file on disk, anyways.

This is a pretty typical use case for dask.array (one of the first things people try with xray), so it's worth seeing if we can come up with a solution that works by default.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduler fail case: centering data with dask.array #874

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Scheduler fail case: centering data with dask.array #874

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions