Skip to content

Avoid large graphs through custom MutableMappings #1763

@mrocklin

Description

@mrocklin

For very large computations, such as sometimes occur in Dask.array, we sometimes run into a limit of graph size. This might occur, for example, when we have 100's of thousands of chunks.

On the scheduler side I suspect that this limitation will continue into the medium-term future at least However there might be some things we can do on the client side to avoid frequently creating and copying many large arrays. Both of these involve relaxing the type of the .dask attribute from dict to MutableMapping and constructing clever mutable mappings that are more efficient in our case.

  1. We could construct a MutableMapping that is a union of several dictionaries. This is roughly what we do with dask.delayed to avoid overzealous copying there. Making a "copy" generally consists of making a shallow copy of a list of dicts. Updating is just appending to this list. Many MutableMapping operations become slower than constant time, but we rarely care about these operations.
  2. We could build a programmatic MutableMapping. Some of our dicts are fairly simple dict comprehensions. We could encode these more efficiently with some programmatic structure. For dask.array particularly some large effort could also go into special dicts for atop, which would be difficult but also cover a large range.

This would help avoid copies. It would possibly help to reduce client-scheduler communication. It would not help to reduce scheduling overhead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions