Add ShuffleStage HLG Layer by rjzamora · Pull Request #6650 · dask/dask

rjzamora · 2020-09-17T19:16:46Z

Depends on #6510

Implements ShuffleStage - A simple HighLevelGraph (HLG) Layer for a single stage of a task-based shuffle in dask.dataframe. To enable culling (without the need to materialize the full graph), the shuffling algorithm was revised to only include tasks that are needed to produce a specific set of output keys.

TODO:

Validate behavior and add tests
Benchmark the new approach

cc @madsbk @quasiben @mrocklin

Follows on from dask#6508

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

jakirkham · 2020-09-29T17:46:03Z

@TomAugspurger, do you have any thoughts on this? 🙂

mrocklin

Oops. I had a bunch of comments queued. My apologies for the delay.

The optimization pass is now significantly faster...

Note that we're caching things, so the timeit call here may not be representative

dask/dataframe/shuffle.py

rjzamora · 2020-09-29T22:46:30Z

Thanks for the review @mrocklin ! Hopefully I addressed your comments.

Note that we're caching things, so the timeit call here may not be representative

Good point - I updated the "benchmark" results above. We still get a nice performance bump (albeit a slightly smaller one).

dask/dataframe/shuffle.py

rjzamora · 2020-10-01T20:13:48Z

Update: I realized this morning that this PR was leaving out the common case of the "simple" shuffle (where the number of output partitions is less than max_branch). I revised the class structure to include both a SimpleShuffleLayer and a shuffleLayer (which inherits from SimpleShuffleLayer).

rjzamora · 2020-10-01T20:23:39Z

dask/dataframe/shuffle.py

+        all input partitions. This method does not require graph
+        materialization.
+        """
+        deps = defaultdict(set)


@madsbk - I just wanted to get your thoughts on something here...

It seems that the HLG cull operation breaks when this deps dictionary is a vanilla dict (rather than a defaultdict(set)). The problem is that we are not actually returning any intra-layer dependencies when the shuffle layer is culled (because we are not actually materializing the graph). This means, we will get a keyerror during this culled_deps[k] access in HighLevelGraph.cull (because we are not adding dependencies for all keys).

Does this seem like it could be a problem?

This should be fixed in #6699, in which cull() is only required to return external dependencies.

Thanks @madsbk !

mrocklin · 2020-10-09T00:01:45Z

@jrbourbeau I'd like to put this on your queue as well

jrbourbeau

Thanks for all your work here @rjzamora!

Please correct me if I'm wrong, but from what I can tell this PR moves existing DataFrame shuffling code into two new SimpleShuffleLayer and ShuffleLayer layer classes. The corresponding low-level task graphs aren't materialized until we try to inspect the underlying task graph (e.g. __getitem__) which will eventually help us send a smaller object to the scheduler (once we're directly sending HighLevelGraphs).

One piece of follow-up work that comes to mind is adding a custom serialization method for these new shuffle layers. The implementation over in #6693 would cause us to fully materialize the task graph for shuffle layers before sending them to the scheduler. Though that's certainly future work we don't need to worry about for this PR : )

Overall I think the changes here look good. The fact that tests are passing gives me confidence. I've left a few small comments, but otherwise this appears to be good to go. Would you recommend we merge this in?

jrbourbeau · 2020-10-15T19:31:19Z

dask/dataframe/tests/test_shuffle.py

+        if name.startswith("shuffle-"):
+            assert isinstance(layer, dd.shuffle.ShuffleLayer)


I think we want to switch these to instead check that if a layer is a ShuffleLayer, then its name begins with "shuffle-"

Suggested change

if name.startswith("shuffle-"):

assert isinstance(layer, dd.shuffle.ShuffleLayer)

if isinstance(layer, dd.shuffle.ShuffleLayer):

assert name.startswith("shuffle-")

I added this check to make sure we are actually using ShuffleLayer layers. So, I worry that if we aren't adding them, then this change will miss it?

Gotcha! I has the concern that if we change our ShuffleLayer naming scheme then we would start to miss this check. I just pushed a small commit which asserts that there are ShuffleLayers in the HLG and that the names of the ShuffleLayers are as expected (i.e. they start with "shuffle-"). That should, I think, handle both of our concerns

dask/dataframe/shuffle.py

dask/dataframe/tests/test_shuffle.py

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

…huffle-hlg

jrbourbeau

Thanks @rjzamora! Will merge once CI passes

rjzamora · 2020-10-16T17:09:00Z

Thanks for your help @jrbourbeau !

Sounds good. My only thought: I'm not sure I understand the purpose of f5dfd60, since we will no longer catch the case that the ShuffleLayer layers are missing.

jrbourbeau · 2020-10-16T18:07:13Z

The

    assert any(
        isinstance(layer, dd.shuffle.ShuffleLayer) for layer in dsk.layers.values()
    )

check should ensure that ShuffleLayer layers aren't missing

rjzamora · 2020-10-16T18:10:59Z

Ah - sorry! I was looking at the wrong change and didn't see that. In that case, I don't think the assert name.startswith("shuffle-") check is really necessary (since we "could" use a different name), but it doesn't hurt... So, I think we are good. Thanks again!

mrocklin and others added 30 commits August 13, 2020 06:42

Add HighLevelGraph.validate call to assert_eq in tests

a636838

Follows on from dask#6508

Merge branch 'master' into hlg-verify

5636f29

subs(): now compare key hashes

14628a5

subs(): now use hash and equality matching

e0c2756

Merge branch 'master' of github.com:dask/dask into sub_use_key_hash

9039fa8

minor cleanup

27acf33

Implemented culling of high level graphs

5b2ea08

Dataframe to use high level graph culling

cfffa3a

arrays to use high level culling

06bc6d4

For now fixing hlg.dependencies

8bd59fe

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

Skip layers nobody depend on

b8f2066

Added class Layer(Mapping)

590d4ba

Implemented a default Layer.cull()

b99d94b

clean up of array.optimize()

9bad423

Use set intersection

8eaed75

docs

678795b

moved cull_highlevelgraph() into HighLevelGraph

58ce174

clean up

6ec590c

Added core.keys_in_tasks()

7acf49a

Layer.get_external_dependencies() to use keys_in_tasks()

10d1ffd

Merge branch 'master' of github.com:dask/dask into cull_high_level_graph

ab21639

reformat: black

a750910

BasicLayer(): added an external_dependencies argument

67ab27f

Blockwise to implement the Layer protocol

3c7877e

Added doctest

b8c8731

Added core.keys_in_tasks()

2ce8cee

HLG: added more thorough validation check

d037c66

lingalg: layer dependencies are now sets

85d9c39

linalg.lstsq(): fixed dependencies

f6920bf

linalg.sfqr(): fixed dependencies

e1a621c

Update dask/dataframe/tests/test_shuffle.py

0d7b55a

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

mrocklin reviewed Sep 29, 2020

View reviewed changes

address code review

a7a9974

TomAugspurger reviewed Sep 30, 2020

View reviewed changes

rjzamora added 5 commits September 30, 2020 12:38

simple naming changes, and removing original shuffle code

9a33b46

add parts_out to _cull_dependencies

e44720e

rename ShuffleStage to ShuffleLayer

c22b25e

support simple shuffle routine

95cae35

fix typo

f7a89cb

rjzamora commented Oct 1, 2020

View reviewed changes

madsbk mentioned this pull request Oct 8, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (… #6707

Merged

avoid deep copy

8b97685

fix _cull typo

4e71888

jrbourbeau reviewed Oct 15, 2020

View reviewed changes

rjzamora and others added 4 commits October 15, 2020 17:12

Update dask/dataframe/tests/test_shuffle.py

9394143

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

input name change

aa9f0ce

Merge branch 'shuffle-hlg' of https://github.com/rjzamora/dask into s…

52dd184

…huffle-hlg

Update layer name check

f5dfd60

jrbourbeau approved these changes Oct 16, 2020

View reviewed changes

jrbourbeau merged commit fbe5174 into dask:master Oct 16, 2020

rjzamora deleted the shuffle-hlg branch October 17, 2020 00:23

jrbourbeau mentioned this pull request Oct 21, 2020

Efficient serialization of shuffle layers #6760

Merged

2 tasks

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Add ShuffleStage HLG Layer (dask#6650)

f0b216c

		if name.startswith("shuffle-"):
		assert isinstance(layer, dd.shuffle.ShuffleLayer)

Uh oh!

Conversation

rjzamora commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Sep 29, 2020

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Sep 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Oct 1, 2020

Uh oh!

rjzamora Oct 1, 2020

Choose a reason for hiding this comment

Uh oh!

madsbk Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Oct 9, 2020

Uh oh!

jrbourbeau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 15, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Oct 15, 2020

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Oct 16, 2020

Uh oh!

jrbourbeau commented Oct 16, 2020

Uh oh!

rjzamora commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rjzamora commented Sep 17, 2020 •

edited

Loading

jrbourbeau left a comment •

edited

Loading

rjzamora commented Oct 16, 2020 •

edited

Loading