Skip to content

Culling massive Blockwise graphs is very slow, not constant-time #8570

@gjoseph92

Description

@gjoseph92

tl;dr: Could get_all_external_keys and get_output_keys be avoided in HLG culling?

In some workflows, it can be desirable to create a dask Array/DataFrame structure representing some full-size, enormous dataset, then immediately use slicing to sub-select out only a tiny part of it, then work with that. Now that we can use Blockwise for IO (xref #7417), this is an especially appealing pattern, because it should be constant-time to construct the massive graph, since nothing has to be materialized.

I had hoped it would also be linear-time to cull this massive graph, but it appears currently that it's not.

Here's an example where I'm trying to create an xarray representing the Landsat-8 collection at full resolution over the entire continental US. This is a 10PB, 136-million-chunk array that involves 1.3 billion data loading tasks. Here I'm using gjoseph92/stackstac#116 (so the data loading graph is fully blockwise) and #8560 (so we know fuse_roots isn't materializing the graph unnecessarily).

Screen Shot 2022-01-14 at 10 37 54 AM

Then I'm sub-selecting a single chunk out of those 136 million. Based on what I know of the graph, this should cull down to 4 tasks.

Screen Shot 2022-01-14 at 10 37 59 AM

Here's the HLG for reference

You can see the first layer is materialized with ~100,000 tasks, but the big, 100-million-task one is Blockwise. (#8497 sure would be nice here!)

Screen Shot 2022-01-14 at 10 30 55 AM

But when I try to optimize this graph, I see memory usage shoot up until it crashes the kernel on the 32GB machine. Interrupting the kernel after a few seconds makes it pretty clear what's going on: HighLevelGraph.cull is calling get_all_external_keys, which is forcing the generation of all 1.3 billion keys (or 136 million keys? not sure).

Screen Shot 2022-01-14 at 10 23 19 AM

Even if it didn't call get_all_external_keys, I see that HighLevelGraph.cull is still calling get_output_keys on every layer. For reference:

dask/dask/highlevelgraph.py

Lines 944 to 970 in 358a5e3

keys_set = set(flatten(keys))
all_ext_keys = self.get_all_external_keys()
ret_layers = {}
ret_key_deps = {}
for layer_name in reversed(self._toposort_layers()):
layer = self.layers[layer_name]
# Let's cull the layer to produce its part of `keys`.
# Note: use .intersection rather than & because the RHS is
# a collections.abc.Set rather than a real set, and using &
# would take time proportional to the size of the LHS, which
# if there is no culling can be much bigger than the RHS.
output_keys = keys_set.intersection(layer.get_output_keys())
if output_keys:
culled_layer, culled_deps = layer.cull(output_keys, all_ext_keys)
# Update `keys` with all layer's external key dependencies, which
# are all the layer's dependencies (`culled_deps`) excluding
# the layer's output keys.
external_deps = set()
for d in culled_deps.values():
external_deps |= d
external_deps -= culled_layer.get_output_keys()
keys_set |= external_deps
# Save the culled layer and its key dependencies
ret_layers[layer_name] = culled_layer
ret_key_deps.update(culled_deps)

Why is it necessary for HLG.cull to ask the layer for all its keys, intersect them itself, then pass that back into the layer? And why is it necessary for cull functions to take all_hlg_keys?

I had thought the interface would be simply: HLG.cull tells each layer the necessary output keys; the layer figures out the rest on its own. If it needs to generate all its keys internally and do that intersection, fine, but for layers that don't need to do this, shouldn't the optimization be available?

@rjzamora @madsbk why does culling work this way? Would it be possible to write Blockwise culling without this, in a way that's truly linear-time to only the number of final keys?

cc @ian-r-rose @TomAugspurger

Metadata

Metadata

Assignees

No one assigned

    Labels

    highlevelgraphIssues relating to HighLevelGraphs.needs attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions