Skip to content

Very large dask arrays #3514

@mrocklin

Description

@mrocklin

When dealing with very large datasets sometimes the task graph handling becomes a bottleneck. What are some ways in which this can be addressed medium-to-long term.

As an example consider a petabyte array cut into gigabyte chunks. This means that each elementwise operation on the array generates a million new tasks. If each task takes up something like 1ms of overhead, then each operation generates around 16 minutes of overhead on the client and scheduler (assuming distributed for the time being). If we're doing operations with a hundred or so operations then we're waiting a day or two with just overhead.

How can we address this situation?

  1. Reduce per-task overhead: 1ms today is actually fairly pessimistic, we're closer to 200-300us, but this requires some effort on our part to ensure that all tasks are cheaply serializable. This means no dynamic functions, or anything that strictly requires cloudpickle or generates a nontrivial amount of data in serialized form.
  2. Fuse tasks: Currently when we do 2 * x + 1 both the 2 * and + 1 operations generate a million tasks. This hurts both when we construct the dask arrays, and when we send them to the scheduler. Ideally we would pre-fuse operations like these, so that we only ever generated one million tasks that did both the 2 * and + 1. This requires that the things we operate on at first to behave differently than dask.arrays as they are currently designed. We need some sort of higher-level array construct. This might either be a fully symbolic array (simliar to SymPy) or it might be a modification to dask.array to support atop fusion.
  3. Overlap overhead with computation: the 16 minutes of overhead might not be that big of a deal if the computation itself is large. In this case it's mostly a UX problem as people feel concerned that their computation hasn't started yet. If we're able to break apart the graph into reasonable chunks then we might be able to stream the graph to the scheduler in pieces. Doing this cheaply and reliably is likely to be very hard though.
  4. Use much larger tasks: Everything becomes easier if we make our tasks larger. Memory and compute power continues to grow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arraydiscussionDiscussing a topic with no specific actions yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions