Discussion: making slicing scale better

I've got an idea for an optimization that I'd like to bounce off some core developers, because it would probably have fairly significant impact on multiple areas of dask.I'd guess @mrocklin and @jcrist would have an interest - anyone else that should be added to cc? Is this the appropriate forum for this sort of discussion?

# Background
For a while I've been gradually trying to improve the performance of dask for slicing out a small piece of a large dask array and computing it. In our application we present an interface somewhat like h5py does, with a huge virtual array that can be indexed to load data from it; but with dask used under the hood to handle all the messy details of lazily loading the right chunks and stitching bits of them together.

Ideally, I'd like to reach a point where `arr[index].compute()` requires time only proportional to the size of `arr[index]`, not `arr` (logarithmic scaling in `arr` would be acceptable). Bugs like #5913 shows that I'm not the only one who is running into performance problems here. I think with #5916 the indexing part should finally scale well, but the computation is still a problem.

Array optimisation has a few steps that are slow (i.e. cales with the number of chunks in the original array):
1. After high-level optimisations, it calls [ensure_dict](https://github.com/dask/dask/blob/d676b2d8a2e776cb711e4da4c95a936c52e338da/dask/array/optimization.py#L46). This copies all the keys from the HighLevelGraph into a flat dict.
2. Almost any operation (like `__iter__`, `__contains__` or `__getitem__`) on a Blockwise will cause it to instantiate its underlying graph. It's cached, so that's fine if it was a Blockwise in the graph of `arr`, but if it's a transient Blockwise generated by `optimize_blockwise` it will occur every time.
3. `fuse_roots` also turns Blockwises into their underlying graphs. I haven't thought about this problem yet so I won't discuss it here.

# Proposal
I would like to combine the operations of `cull` and `ensure_dict` into a single operation that extracts only the necessary keys from a HLG. In fact, `cull` can already work from a HLG (since it is a Mapping), so I'm not entirely sure why `ensure_dict` is needed. I'm guessing that key lookups on a HLG are slower because the key has to be searched in every layer, so doing `ensure_dict` early speeds things up in general. So if we walk the dependencies on the HLG we'd need to speed up lookup: can we assume that keys always either match the layer name or are a tuple starting with the layer name? #5824 and #5850 suggest we might not be quite there yet. We could possibly also use the dependency information in the HLG to restrict the search when chasing dependencies.

That leads to my next proposal: defining some protocol for graph layers, with a default implementation for plain-dict layers, and a custom implementation for Blockwise. My rough idea for the interface is that Blockwise (or any other layer class) would define a method that takes an iterable of keys, and returns a dict representing the graph for those keys (of those that are present) and a list of dependencies (which I'm guessing Blockwise can produce more efficiently than running `get_dependencies` on the returned graph). Then `cull` looks something like this:
1. Start with `keys` equal to the needed keys.
2. For each layer in topological order (from roots), ask the layer for its piece of the graph (passing `keys`). Use it to update the output graph, remove the subgraph keys from `keys` and add the new dependencies to `keys`.

Possibly variations would involve the protocol method modifying data structures in place to avoid double-copies, but that might be a more brittle interface.

As discussed above, that won't scale well with large numbers of layers because every key gets tested against every layer, so it would probably need to be refined to group keys into layers up front.

# Issues
My main concern is that while this will improve scaling for one use case, it may significantly increase the overheads in the general case where most or all of the graph is needed for the computation. Flattening things into plain dicts is slow up front but then operations on the plain dict are relatively very fast, and when doing an operation on every key in the graph the current approach will be hard to beat.

The other concern is how feasible it will be to enforce relationships between keys and layer names, given that third-party code may be building their own graphs. I think it will be fine for layers to have internal keys that are only referenced from inside the layer - it's only inter-layer keys that will need to match, and hopefully that's generally the case with our collections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: making slicing scale better #5918

Background

Proposal

Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Discussion: making slicing scale better #5918

Description

Background

Proposal

Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions