Skip to content

Scheduler fail case: centering data with dask.array #874

@shoyer

Description

@shoyer

A common use case for many modeling problems (e.g., in machine learning or climate science) is to center data by subtracting an average of some kind over a given axis. The dask scheduler currently falls flat on its face when attempting to schedule these types of problems.

Here's a simple example of such a fail case:

import dask.array as da
x = da.ones((8, 200, 200), chunks=(1, 200, 200))  # e.g., a large stack of image data
mad = abs(x - x.mean(axis=0)).mean()
mad.visualize()

image

The scheduler will keep each of the initial chunks in memory that it uses to compute the mean, because they will be used later to as an argument to sub. In contrast, the appropriate way to handle this graph to avoid blowing up memory would be to compute the initial chunks twice.

I know that in principle this could be avoided by using an on-disk cache. But this seems like a waste, because the initial values are often sitting in a file on disk, anyways.

This is a pretty typical use case for dask.array (one of the first things people try with xray), so it's worth seeing if we can come up with a solution that works by default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions