Culling massive Blockwise graphs is very slow, not constant-time

**tl;dr: Could `get_all_external_keys` and `get_output_keys` be avoided in HLG culling?**

In some workflows, it can be desirable to create a dask Array/DataFrame structure representing some full-size, enormous dataset, then immediately use slicing to sub-select out only a tiny part of it, then work with that. Now that we can use Blockwise for IO (xref https://github.com/dask/dask/pull/7417), this is an especially appealing pattern, because it should be constant-time to construct the massive graph, since nothing has to be materialized.

I had hoped it would also be linear-time to cull this massive graph, but it appears currently that it's not.

Here's an example where I'm trying to create an xarray representing the [Landsat-8](https://planetarycomputer.microsoft.com/dataset/landsat-8-c2-l2) collection at full resolution over the entire continental US. This is a 10PB, 136-million-chunk array that involves 1.3 billion data loading tasks. Here I'm using https://github.com/gjoseph92/stackstac/pull/116 (so the data loading graph is fully blockwise) and https://github.com/dask/dask/pull/8560 (so we know `fuse_roots` isn't materializing the graph unnecessarily).

![Screen Shot 2022-01-14 at 10 37 54 AM](https://user-images.githubusercontent.com/3309802/149560244-8b8bec05-f3e5-4427-a079-3714677e77a7.png)

Then I'm sub-selecting a single chunk out of those 136 million. Based on what I know of the graph, this should cull down to 4 tasks.

![Screen Shot 2022-01-14 at 10 37 59 AM](https://user-images.githubusercontent.com/3309802/149560279-e40a3d55-aedf-449b-b931-a9f2fd2a1645.png)

<details><summary>Here's the HLG for reference</summary>

You can see the first layer is materialized with ~100,000 tasks, but the _big_, 100-million-task one is Blockwise. (https://github.com/dask/dask/issues/8497 sure would be nice here!)

![Screen Shot 2022-01-14 at 10 30 55 AM](https://user-images.githubusercontent.com/3309802/149559272-8f5e0f13-863e-49ac-8482-5876cd2b9617.png)

</details>

But when I try to optimize this graph, I see memory usage shoot up until it crashes the kernel on the 32GB machine. Interrupting the kernel after a few seconds makes it pretty clear what's going on: `HighLevelGraph.cull` is calling `get_all_external_keys`, which is forcing the generation of all 1.3 billion keys (or 136 million keys? not sure).

![Screen Shot 2022-01-14 at 10 23 19 AM](https://user-images.githubusercontent.com/3309802/149558357-d981f103-cea9-44bb-b44e-ed7070003b77.png)

Even if it didn't call `get_all_external_keys`, I see that `HighLevelGraph.cull` is still calling `get_output_keys` on every layer. For reference: https://github.com/dask/dask/blob/358a5e367eedc8f2963651071a72f43e0ac4f887/dask/highlevelgraph.py#L944-L970

Why is it necessary for HLG.cull to ask the layer for all its keys, intersect them itself, then pass that back into the layer? And why is it necessary for cull functions to take `all_hlg_keys`?

I had thought the interface would be simply: HLG.cull tells each layer the necessary output keys; the layer figures out the rest on its own. If it needs to generate all its keys internally and do that intersection, fine, but for layers that don't need to do this, shouldn't the optimization be available?

@rjzamora @madsbk why does culling work this way? Would it be possible to write Blockwise culling without this, in a way that's truly linear-time to only the number of final keys?

cc @ian-r-rose @TomAugspurger 

	keys_set = set(flatten(keys))

	all_ext_keys = self.get_all_external_keys()
	ret_layers = {}
	ret_key_deps = {}
	for layer_name in reversed(self._toposort_layers()):
	layer = self.layers[layer_name]
	# Let's cull the layer to produce its part of `keys`.
	# Note: use .intersection rather than & because the RHS is
	# a collections.abc.Set rather than a real set, and using &
	# would take time proportional to the size of the LHS, which
	# if there is no culling can be much bigger than the RHS.
	output_keys = keys_set.intersection(layer.get_output_keys())
	if output_keys:
	culled_layer, culled_deps = layer.cull(output_keys, all_ext_keys)
	# Update `keys` with all layer's external key dependencies, which
	# are all the layer's dependencies (`culled_deps`) excluding
	# the layer's output keys.
	external_deps = set()
	for d in culled_deps.values():
	external_deps \|= d
	external_deps -= culled_layer.get_output_keys()
	keys_set \|= external_deps

	# Save the culled layer and its key dependencies
	ret_layers[layer_name] = culled_layer
	ret_key_deps.update(culled_deps)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Culling massive Blockwise graphs is very slow, not constant-time #8570

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Culling massive Blockwise graphs is very slow, not constant-time #8570

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions