-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
For very large computations, such as sometimes occur in Dask.array, we sometimes run into a limit of graph size. This might occur, for example, when we have 100's of thousands of chunks.
On the scheduler side I suspect that this limitation will continue into the medium-term future at least However there might be some things we can do on the client side to avoid frequently creating and copying many large arrays. Both of these involve relaxing the type of the .dask attribute from dict to MutableMapping and constructing clever mutable mappings that are more efficient in our case.
- We could construct a MutableMapping that is a union of several dictionaries. This is roughly what we do with dask.delayed to avoid overzealous copying there. Making a "copy" generally consists of making a shallow copy of a list of dicts. Updating is just appending to this list. Many MutableMapping operations become slower than constant time, but we rarely care about these operations.
- We could build a programmatic MutableMapping. Some of our dicts are fairly simple dict comprehensions. We could encode these more efficiently with some programmatic structure. For dask.array particularly some large effort could also go into special dicts for atop, which would be difficult but also cover a large range.
This would help avoid copies. It would possibly help to reduce client-scheduler communication. It would not help to reduce scheduling overhead.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels