Culling high level graphs by madsbk · Pull Request #6510 · dask/dask

madsbk · 2020-08-13T18:55:26Z

As a step towards defining a Layer(Mapping) class (#6438), I have been exploring how we can cull high level graphs directly. I think it is useful to have a concrete implementation before deciding on a Layer interface.

This PR implements class Layer(Mapping), which is an abstract class that establish a protocol for high level graph layers.

The class defines three methods that sub-classes can overwrite in order to use domain knowledge to reduce overhead:

class Layer(collections.abc.Mapping):
    """High level graph layer

    This abstract class establish a protocol for high level graph layers.
    """

    def cull(self, keys: Set) -> "Layer":
        """Return a new Layer with only the tasks required to calculate `keys`.

        In other words, remove unnecessary tasks from the layer.

        Examples
        --------
        >>> d = Layer({'x': 1, 'y': (inc, 'x'), 'out': (add, 'x', 10)})  # doctest: +SKIP
        >>> d.cull({'out'})  # doctest: +SKIP
        {'x': 1, 'out': (add, 'x', 10)}

        Returns
        -------
        layer: Layer
            Culled layer
        """

    def get_external_dependencies(self, all_hlg_keys) -> Set:
        """Get external dependencies

        Parameters
        ----------
        all_hlg_keys : container
            All keys in the high level graph.

        Returns
        -------
        deps: set
            Set of dependencies
        """

    def get_dependencies(self, all_hlg_keys) -> Mapping[Hashable, Set]:
        """Get dependencies of all keys in the layer

        Parameters
        ----------
        all_hlg_keys : container
            All keys in the high level graph.

        Returns
        -------
        map: Mapping
            A map that maps each key in the layer to its dependencies
        """

Blockwise

This PR also implements a sub-class Blockwise(Layer) that uses the blockwise structure to compute key dependencies and culling efficiently. .cull() and .get_external_dependencies() returns an instance of BasicLayer(Layer) that embeds the information needed to do culling efficiently.

ParquetSubgraph

This PR also implements a sub-class ParquetSubgraph(Layer) that implements get_external_dependencies() by always returning the empty set and cull() by filtering parts.

cc. @mrocklin, @rjzamora, @quasiben

Follows on from dask#6508

mrocklin · 2020-08-13T20:29:13Z

Thanks for pushing something up @madsbk . Having something concrete makes it a lot easier to think about this problem.

For the process of clearing out high level layers that aren't being used, this is clearly a win.

For lower level culling I still don't know exactly how to handle culling of the graphs within layers. Eventually, we're going to want to do this on the scheduler side. That means that we're going to want to cull the concrete low level graphs that we have here on the client side (the layers backed by dicts) but not explicitly expand the layers that have high level representations, like Blockwise, or a future Shuffle layer. We're going to want to keep those in abstract form so that we can pass them to the Scheduler.

With that in mind, I wonder if we maybe want the following two functions on every layer:

Given the keys that we are being asked to produce, generate the task graph that corresponds to those keys.

This would be mildly more efficient than today's generate-then-cull even in the early case when everything is on the client side.
Given the keys that we are being asked to produce, generate the keys that we will need from each of our dependency layers.

This will allow us to pass down enough information to continue the local culling process on dict-backed layers, but will not force us to go through the generation/culling step on the client side just yet.

As a test case we might create these methods on the ReadParquet and then the Blockwise mapping classes. The ReadParquet class is probably very easy and the Blockwise class is probably very hard. They might make good test cases though. To verify effectiveness we could do something like

df = dd.read_parquet("...")
df.head()

And then follow things through and see that we never actually generated the full graph. We could then expand this to ...

df = dd.read_parquet("...")
df["z"] = df.x + df.y
df.head()

And hopefully see the same thing, even now that there are now two layers and also a Blockwise in there.

This is just my impressions after looking at this code though. There may be better approaches here.

quasiben · 2020-08-18T15:45:15Z

@madsbk I kicked CI to re-run the tests

dask/optimization.py

mrocklin

A couple of small comments, but it's probably too early for this kind of review.

dask/highlevelgraph.py

dask/array/optimization.py

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

dask/dataframe/optimize.py

dask/highlevelgraph.py

mrocklin

In general this looks good to me. I'm actually surprised at how small this change ended up being. In my mind it was considerably larger. Thank you for exploring this space @madsbk . I feel like we both have a much better sense for how this work is going to go now.

I've made a few comments, but they're all pretty minor or future-leaning. If you have time to clean things up tomorrow (I'm guessing that this will take 30m ) I'm happy to merge when I wake up.

mrocklin · 2020-09-23T01:26:27Z

dask/blockwise.py


+    def get_dependencies(self, all_hlg_keys):
+        _ = self._dict  # trigger materialization
+        return self._cached_dict["basic_layer"].get_dependencies(all_hlg_keys)


Eventually I think that we're going to want to separate dependencies from graph generation. This can safely be future work though.

Agree and agree :)

dask/blockwise.py

mrocklin · 2020-09-23T01:32:01Z

dask/dataframe/io/parquet/core.py

+            part_ids=[i for i in self.part_ids if (self.name, i) in keys],
+        )


I don't know of any currently. It's probably cheaper to sort on an as-needed basis though than do what we are here though.

mrocklin · 2020-09-23T14:53:02Z

Thanks @madsbk ! This is in.

rjzamora · 2020-09-23T14:53:42Z

Woohoo - thanks @madsbk! And thanks @mrocklin for reviewing :)

JSKenyon · 2020-09-28T14:21:40Z

@madsbk and @mrocklin this PR seems to have broken some custom graph construction which was working for me prior to 2.28.0. If optimize_graph is True I now get errors like AttributeError: 'dict' object has no attribute 'discard'. I will likely look into raising a bug report tomorrow, but before I do, I was wondering if either of you might have an intuition regarding what could be going wrong? My instinct is that there is a dictionary where there should be a set, but I am unsure how these changes would have exposed this, given that it definitely worked prior to 2.28.0.

mrocklin · 2020-09-28T16:08:40Z

Thanks JSKenyon. If you can provide a full traceback that would probably help us to identify the cause. A new issue would be good. Thanks!

…

On Mon, Sep 28, 2020 at 7:21 AM JSKenyon ***@***.***> wrote: @madsbk <https://github.com/madsbk> and @mrocklin <https://github.com/mrocklin> this PR seems to have broken some custom graph construction which was working for me prior to 2.28.0. If optimize_graph is True I now get errors like AttributeError: 'dict' object has no attribute 'discard'. I will likely look into raising a bug report tomorrow, but before I do, I was wondering if either of you might have an intuition regarding what could be going wrong? My instinct is that there is a dictionary where there should be a set, but I am unsure how these changes would have exposed this, given that it definitely worked prior to 2.28.0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6510 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTE7EKAKTWKBGAKDSQTSICLYJANCNFSM4P6XX6UA> .

This reverts commit f0b2ac2.

…6697) This reverts commit f0b2ac2.

…ask#6510)" (dask#6697)" This reverts commit e09d8d9.

…6510)" (#6697)" (#6707) This reverts commit e09d8d9.

Follows on from dask#6508 * Implemented culling of high level graphs * Skip layers nobody depend on * Added class Layer(Mapping) * Implemented a default Layer.cull() * Added core.keys_in_tasks() * Layer.get_external_dependencies() to use keys_in_tasks() * Blockwise to implement the Layer protocol * optimize_blockwise(): fixed dependencies * ParquetSubgraph(): implemented Layer * Implemented HLG.keyset() * Layer.cull(): now returns a new Layer and key dependencies

…" (dask#6697) This reverts commit f0b2ac2.

…ask#6510)" (dask#6697)" (dask#6707) This reverts commit e09d8d9.

In dask#8452 I realized that an incorrect pattern had emerged from dask#6510 of including ```python if not isinstance(dsk, HighLevelGraph): dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=()) ``` in optimization functions. Specifically, `id(dsk)` is incorrect as the layer name here. The layer name must match the `.name` of the resulting collection that gets created by `__dask_postpersist__()`, otherwise `__dask_layers__()` on the optimized collection will be wrong. Since `optimize` doesn't know about collections and isn't passed a layer name, the only reasonable thing to do here is to error when given a low-level graph. This is safe to do for Arrays and DataFrames, since their constructors convert any low-level graphs to HLGs. This PR doesn't really fix anything—the code path removed should be unused—but it eliminates a confusing pattern that has already wandered its way into other places dask#8316 (comment).

Add HighLevelGraph.validate call to assert_eq in tests

a636838

Follows on from dask#6508

Merge branch 'master' into hlg-verify

5636f29

mrocklin reviewed Aug 19, 2020

View reviewed changes

dask/optimization.py Outdated Show resolved Hide resolved

mrocklin reviewed Aug 20, 2020

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

dask/array/optimization.py Outdated Show resolved Hide resolved

subs(): now compare key hashes

14628a5

madsbk mentioned this pull request Aug 26, 2020

subs() comparing key hashes #6559

Merged

2 tasks

madsbk added 10 commits August 26, 2020 21:11

subs(): now use hash and equality matching

e0c2756

Merge branch 'master' of github.com:dask/dask into sub_use_key_hash

9039fa8

minor cleanup

27acf33

Implemented culling of high level graphs

5b2ea08

Dataframe to use high level graph culling

cfffa3a

arrays to use high level culling

06bc6d4

For now fixing hlg.dependencies

8bd59fe

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

Skip layers nobody depend on

b8f2066

Added class Layer(Mapping)

590d4ba

Implemented a default Layer.cull()

b99d94b

madsbk force-pushed the cull_high_level_graph branch from b305e37 to 62bc542 Compare August 27, 2020 09:08

clean up of array.optimize()

9bad423

madsbk force-pushed the cull_high_level_graph branch from 62bc542 to 9bad423 Compare August 27, 2020 11:42

madsbk added 4 commits August 27, 2020 13:59

Use set intersection

8eaed75

docs

678795b

moved cull_highlevelgraph() into HighLevelGraph

58ce174

clean up

6ec590c

mrocklin reviewed Aug 28, 2020

View reviewed changes

dask/dataframe/optimize.py Outdated Show resolved Hide resolved

mrocklin reviewed Aug 28, 2020

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

mrocklin reviewed Aug 28, 2020

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

madsbk added 2 commits August 28, 2020 19:34

Added core.keys_in_tasks()

7acf49a

Layer.get_external_dependencies() to use keys_in_tasks()

10d1ffd

Merge branch 'master' of github.com:dask/dask into cull_high_level_graph

2589faf

mrocklin reviewed Sep 23, 2020

View reviewed changes

dask/highlevelgraph.py Show resolved Hide resolved

mrocklin reviewed Sep 23, 2020

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

mrocklin reviewed Sep 23, 2020

View reviewed changes

dask/highlevelgraph.py Show resolved Hide resolved

mrocklin changed the title ~~[WIP] Culling high level graphs~~ Culling high level graphs Sep 23, 2020

mrocklin reviewed Sep 23, 2020

View reviewed changes

madsbk added 2 commits September 23, 2020 08:58

HLG: typing using Iterable instead of Container

10343ea

ParquetSubgraph(): storing part_ids as a set instead of a list

c474025

mrocklin merged commit f0b2ac2 into dask:master Sep 23, 2020

madsbk deleted the cull_high_level_graph branch September 24, 2020 07:52

JSKenyon mentioned this pull request Sep 29, 2020

Changes in 2.28.0 break complex custom graph #6684

Closed

TomAugspurger mentioned this pull request Oct 1, 2020

2.28.0 performance related issues #6694

Closed

TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Oct 1, 2020

Revert "Use HighLevelGraph layers everywhere in collections (dask#6510)"

afba3a1

This reverts commit f0b2ac2.

This was referenced Oct 2, 2020

Release 2.29.0 dask/community#98

Closed

Weekly release schedule dask/community#84

Open

jrbourbeau pushed a commit that referenced this pull request Oct 2, 2020

Revert "Use HighLevelGraph layers everywhere in collections (#6510)" (#…

e09d8d9

…6697) This reverts commit f0b2ac2.

sjperkins mentioned this pull request Oct 5, 2020

Layer Annotations #6701

Closed

TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Oct 6, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (d…

7b2973b

…ask#6510)" (dask#6697)" This reverts commit e09d8d9.

TomAugspurger mentioned this pull request Oct 6, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (… #6707

Merged

jrbourbeau pushed a commit that referenced this pull request Oct 8, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (#…

d4447e7

…6510)" (#6697)" (#6707) This reverts commit e09d8d9.

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Revert "Use HighLevelGraph layers everywhere in collections (dask#6510)…

bab0fad

…" (dask#6697) This reverts commit f0b2ac2.

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (d…

51785c4

…ask#6510)" (dask#6697)" (dask#6707) This reverts commit e09d8d9.

gjoseph92 mentioned this pull request Dec 13, 2021

Array/DataFrame optimization requires HLG #8481

Open

2 tasks

jakirkham mentioned this pull request Aug 29, 2024

The Gaussian filter is not behaving lazily as expected and is allocating RAM for the entire array dask/dask-image#386

Open

		part_ids=[i for i in self.part_ids if (self.name, i) in keys],
		)

Uh oh!

Conversation

madsbk commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Blockwise

ParquetSubgraph

Uh oh!

mrocklin commented Aug 13, 2020

Uh oh!

quasiben commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

madsbk Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Sep 23, 2020

Uh oh!

rjzamora commented Sep 23, 2020

Uh oh!

JSKenyon commented Sep 28, 2020

Uh oh!

mrocklin commented Sep 28, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

madsbk commented Aug 13, 2020 •

edited

Loading

quasiben commented Aug 18, 2020 •

edited

Loading