Skip to content

Dask Array graph optimization functions #2472

@jakirkham

Description

@jakirkham

Sometimes in constructing a computation with Dask Array's dead ends will show up. While it is true that these get removed at computation time, it is nice to be able to some cleanup periodically to keep the Dask Graph size reasonable. Particularly this cleanup is nice if we know a dead end will show up (e.g. slicing).

Previously this happened automatically with slicing, but it proved problematic in general ( #1732 ). A reasonable alternative would be to provide these optimization operations directly to the user. That way it is up to them to make the appropriate decision.

While it is true that there are optimization functions for Dask graphs, it remains unclear (at least to me) how one applies these to an array in general outside of Dask. To get the relevant keys, one must call _keys, which appears to be part of the Private API. Trying to get at this from the Public API does not appear to be straightforward. Even once one performs this sort of optimization, there remains the question of how to get the resulting Dask Graph back into a Dask Array.

Below is what I found works to get cull to act on a Dask Array. However this seems to require using the private API to get the job done. It seems a reasonable solution to this problem would be to create a wrapper function using a workflow like the one below (with any other things I may have missed) and add the wrapper to the private API. Then every function in dask.optimize can be wrapped with this wrapper function and added to the public API.

import dask
import dask.array
import dask.array.core
import dask.sharedict

d = dask.array.ones((10, 12), chunks=(5, 6))

sd = dask.sharedict.ShareDict()
sd.update(dask.optimize.cull(d.dask, d._keys())[0])

do = dask.array.core.Array(sd, d.name, d.chunks, d.dtype)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions