Avoid large graphs through custom MutableMappings

For very large computations, such as sometimes occur in Dask.array, we sometimes run into a limit of graph size.  This might occur, for example, when we have 100's of thousands of chunks.  

On the scheduler side I suspect that this limitation will continue into the medium-term future at least  However there might be some things we can do on the client side to avoid frequently creating and copying many large arrays.  Both of these involve relaxing the type of the `.dask` attribute from `dict` to `MutableMapping` and constructing clever mutable mappings that are more efficient in our case.

1.  We could construct a MutableMapping that is a union of several dictionaries.  This is roughly what we do with dask.delayed to avoid overzealous copying there.  Making a "copy"  generally consists of making a shallow copy of a list of dicts.  Updating is just appending to this list.  Many MutableMapping operations become slower than constant time, but we rarely care about these operations.
2.  We could build a programmatic MutableMapping.  Some of our dicts are fairly simple dict comprehensions.  We could encode these more efficiently with some programmatic structure.  For dask.array particularly some large effort could also go into special dicts for atop, which would be difficult but also cover a large range.

This would help avoid copies.  It would possibly help to reduce client-scheduler communication.  It would not help to reduce scheduling overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid large graphs through custom MutableMappings #1763

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Avoid large graphs through custom MutableMappings #1763

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions