-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I've got an idea for an optimization that I'd like to bounce off some core developers, because it would probably have fairly significant impact on multiple areas of dask.I'd guess @mrocklin and @jcrist would have an interest - anyone else that should be added to cc? Is this the appropriate forum for this sort of discussion?
Background
For a while I've been gradually trying to improve the performance of dask for slicing out a small piece of a large dask array and computing it. In our application we present an interface somewhat like h5py does, with a huge virtual array that can be indexed to load data from it; but with dask used under the hood to handle all the messy details of lazily loading the right chunks and stitching bits of them together.
Ideally, I'd like to reach a point where arr[index].compute() requires time only proportional to the size of arr[index], not arr (logarithmic scaling in arr would be acceptable). Bugs like #5913 shows that I'm not the only one who is running into performance problems here. I think with #5916 the indexing part should finally scale well, but the computation is still a problem.
Array optimisation has a few steps that are slow (i.e. cales with the number of chunks in the original array):
- After high-level optimisations, it calls ensure_dict. This copies all the keys from the HighLevelGraph into a flat dict.
- Almost any operation (like
__iter__,__contains__or__getitem__) on a Blockwise will cause it to instantiate its underlying graph. It's cached, so that's fine if it was a Blockwise in the graph ofarr, but if it's a transient Blockwise generated byoptimize_blockwiseit will occur every time. fuse_rootsalso turns Blockwises into their underlying graphs. I haven't thought about this problem yet so I won't discuss it here.
Proposal
I would like to combine the operations of cull and ensure_dict into a single operation that extracts only the necessary keys from a HLG. In fact, cull can already work from a HLG (since it is a Mapping), so I'm not entirely sure why ensure_dict is needed. I'm guessing that key lookups on a HLG are slower because the key has to be searched in every layer, so doing ensure_dict early speeds things up in general. So if we walk the dependencies on the HLG we'd need to speed up lookup: can we assume that keys always either match the layer name or are a tuple starting with the layer name? #5824 and #5850 suggest we might not be quite there yet. We could possibly also use the dependency information in the HLG to restrict the search when chasing dependencies.
That leads to my next proposal: defining some protocol for graph layers, with a default implementation for plain-dict layers, and a custom implementation for Blockwise. My rough idea for the interface is that Blockwise (or any other layer class) would define a method that takes an iterable of keys, and returns a dict representing the graph for those keys (of those that are present) and a list of dependencies (which I'm guessing Blockwise can produce more efficiently than running get_dependencies on the returned graph). Then cull looks something like this:
- Start with
keysequal to the needed keys. - For each layer in topological order (from roots), ask the layer for its piece of the graph (passing
keys). Use it to update the output graph, remove the subgraph keys fromkeysand add the new dependencies tokeys.
Possibly variations would involve the protocol method modifying data structures in place to avoid double-copies, but that might be a more brittle interface.
As discussed above, that won't scale well with large numbers of layers because every key gets tested against every layer, so it would probably need to be refined to group keys into layers up front.
Issues
My main concern is that while this will improve scaling for one use case, it may significantly increase the overheads in the general case where most or all of the graph is needed for the computation. Flattening things into plain dicts is slow up front but then operations on the plain dict are relatively very fast, and when doing an operation on every key in the graph the current approach will be hard to beat.
The other concern is how feasible it will be to enforce relationships between keys and layer names, given that third-party code may be building their own graphs. I think it will be fine for layers to have internal keys that are only referenced from inside the layer - it's only inter-layer keys that will need to match, and hopefully that's generally the case with our collections.