High Level Graphs #4092

mrocklin · 2018-10-12T21:43:43Z

This implements a HighLevelGraph that stores the task graphs of our collections in layers, generally one layer per high level operation. Today this doesn't add any new features except for easier inspectability and visualization, but it opens the door for new and exciting features in the future, notably high-level expression optimization.

I recommend reading through the high-level-graphs.rst doc page first, and then take a look at dask dataframe to see a simple example of use.

mrocklin · 2018-10-12T22:28:42Z

I think that generally my plan here is as follows:

pick up easy wins in dask array
refactor top
move onto dask dataframe
refactor top again so that it can also be used in dask dataframe
move onto bag

Then future work for others is probably:

add a slicing operation, and figure out how to transpose it with atop operations
add parquet and column access operations and transpose them

shoyer · 2018-10-13T00:44:36Z

dask/highgraph.py

+
+import toolz
+
+class HighGraph(sharedict.ShareDict):


Two thoughts:

I like the name HighLevelGraph better than HighGraph.

Can you make this extend ShareDict with composition instead of inheritance? I find that easier to understand and less error prone.

No objection to HighLevelGraph

I would like to just replace ShareDict entirely.

I would like to just replace ShareDict entirely.

OK, works for me. I just wouldn't bother with inheritance, then.

Agreed, its only there short term as I trade things out

I second HighLevelGraph

mrocklin · 2018-10-15T15:16:36Z

import dask.array as da
a = da.ones(100, chunks=(20,))
b = a + 1
c = a + 2
d = (b + c).sum()
d.visualize('low-level.png')
d.dask.visualize('high-level.png')

Low level

High level

mrocklin · 2018-10-15T15:22:01Z

OK, so I've been through most of the code and changed things around. So far all this does is introduce HighGraphs in graph generation code. It doesn't add optimization handling, use top in dataframes, or do anything with the high level graphs (I'd like to defer these to future PRs). I did have to screw around a bit in the atop code and with delayed to_task_dasks.

I plan to change the name later (but before merging) in a global find-replace.

I think that now this could use some review. I recommend the following:

Take a look at the HighGraph implementation
Take a look at the changes to dataframe/core.py . This is representative of the workflow that most devs would have to adopt day-to-day when working on Dask
Take a look at array/linalg.py::tsqr . This is representative of worst-case.

There is still work to do here with some corners, but I think that this is now at a stage where it's ready for review. (cc @jcrist)

jcrist

On initial review this (and this comment) seems sane to me. Documentation on HighGraph, as well as the intent of __dask_layers__ would be welcome. From what I understand these are the keys in the HighGraph that the collection output depends on directly (usually just the key)? It would be good to formally state this, as well as the desired output type (in the code I saw lists, sets, and tuples all being returned by different collections).

jcrist · 2018-10-24T17:58:15Z

dask/dataframe/utils.py

    if hasattr(dask, 'dask'):
        dask = dask.dask
-    assert isinstance(dask, dict)
+    assert isinstance(dask, collections.Mapping)


nit: should be dask.compatibility.Mapping, abc classes on the collections module directly are deprecated.

jcrist · 2018-10-24T18:01:44Z

dask/delayed.py

    return out


+def finalize(collection):


Nit: This conflicts in my mind with the old version of dask finalize functions (results -> in memory version). It's also not a very descriptive name. Perhaps collection_to_delayed? Or just inline in unpack_collections below (my preference).

jcrist · 2018-10-24T18:09:17Z

dask/highgraph.py

+        return self.layers
+
+    @classmethod
+    def from_collections(cls, name, layer, dependencies=()):


The overloading of "dependencies" here was confusing to me (especially without a docstring). Took me a second to realize that while dependencies above are collections, the collections aren't being stored directly in the .dependencies attribute.

jcrist · 2018-10-24T18:11:28Z

dask/delayed.py

+    Returns
+    -------
+    task : normalized task to be run
+    collections : a set of collections


While this may be true for our implementations, I don't think we should enforce dask collections be hashable. Elsewhere you use toolz.unique by id, which seems safer to me.

jcrist · 2018-10-24T18:13:06Z

dask/dataframe/core.py

        # divisions is ignored, only present to be compatible with other
        # objects.
+        if not isinstance(dsk, HighGraph):
+            dsk = HighGraph.from_collections(name, dsk, dependencies=[])


Is specifying dependencies=[] necessary? I'd expect the default to be fine.

The defaul is fine, but I like to call out that we're not including dependencies here, which is undesired.

jcrist · 2018-10-24T18:15:49Z

dask/highgraph.py

+
+import toolz
+
+class HighGraph(sharedict.ShareDict):


I second HighLevelGraph

jcrist · 2018-10-24T18:16:41Z

dask/highgraph.py

+                dependencies[id(g)] = set()
+            else:
+                raise TypeError(g)
+        return HighGraph(layers, dependencies)


Should probably use cls here and above, makes subclassing easier in the future if that's ever needed.

jcrist · 2018-10-24T18:18:04Z

dask/dot.py

+    return handle_graphviz(g, filename, format)

+
+def handle_graphviz(g, filename, format):


This could use a better name. Perhaps graph_to_file?

jakirkham · 2018-10-30T14:36:27Z

dask/array/top.py

-
        # replace keys in kwargs with _0 tokens
-        new_keys = list(core.get_dependencies(dsk_kwargs, task=kwargs))
+        # new_keys = list(core.get_dependencies(dsk_kwargs, task=kwargs))


nit: Should this line be dropped?

jakirkham · 2018-10-30T14:42:14Z

dask/dataframe/core.py

+            pdb.set_trace()
+        if any(isinstance(layer, HighLevelGraph) for layer in dsk.layers.values()):
+            import pdb
+            pdb.set_trace()


Are these pdb parts still suppose to be here?

jakirkham · 2018-10-30T14:52:43Z

dask/highlevelgraph.py

+     'add': {('add', 0): (operator.add, 'myfile.0.csv', 100),
+             ('add', 1): (operator.add, 'myfile.1.csv', 100),
+             ('add', 2): (operator.add, 'myfile.2.csv', 100),
+             ('add', 3): (operator.add, 'myfile.3.csv', 100)}


Sorry trying to follow this here, should 'myfile.*.csv' be ('read-csv', *) or am I missing something?

Edit: FWIW this shows up in the docs too. Not sure that it needs to change (since I'm still learning). Just footnoting it in case it does.

jakirkham · 2018-10-30T14:56:50Z

dask/tests/test_delayed.py

-    f = dask.pop(task)
-    assert f == (tuple, ['a', 'b', 'c'])
-    assert dask == x._dask
+    with warnings.catch_warnings(record=True):


What's this needed for?

Edit: NVM because to_task_dask is being deprecated in this PR.

jakirkham · 2018-10-30T15:11:18Z

Thanks @mrocklin. Looks pretty good.

Mostly found a few nits upon first reading the code. Probably need to let it sink in a bit more before any deeper discussion. That said, the idea and logic of it all seemed good.

Currently we are just building towards high level optimizations, correct? Just wanted to make sure there weren't any optimizations included in this PR that I missed and probably deserve a closer look.

Seems like some thought was given to handling existing custom graph code (as long as it was being turned into a Dask included collection), which is great. That should make it easier for older codes to adopt I would think.

Trying to think if there is anything we can do to make it easier for people using ShareDict currently to migrate (especially if they will be straddling different versions of Dask). It's just deprecated at this stage. So this isn't too bad really. Just wondering really.

mrocklin · 2018-10-31T14:19:52Z

Mostly found a few nits upon first reading the code

Thanks for the review @jakirkham . Those were helpful to identify.

Currently we are just building towards high level optimizations, correct? Just wanted to make sure there weren't any optimizations included in this PR that I missed and probably deserve a closer look.

You're correct here. We don't actually add any new functionality in this PR. It should make features like high level optimizations much easier to achieve though.

Trying to think if there is anything we can do to make it easier for people using ShareDict currently to migrate (especially if they will be straddling different versions of Dask). It's just deprecated at this stage. So this isn't too bad really. Just wondering really.

I'm inclined not to care too much about this. I suspect that this only affects a few fairly sophisticated users.

jakirkham

Sorry for the additional nits. Tried to group them into a PR review to reduce noise while still isolating them. May have missed a few myself in the last review.

This makes me wonder if there is a way to get Sphinx to reference one example in multiple places to make maintenance efforts more focused. Guess there are tradeoffs in terms of potential added indirection. Maybe something to think about though.

jakirkham · 2018-10-31T17:44:50Z

docs/source/high-level-graphs.rst

+    ('add', 0): (operator.add, 'myfile.0.csv', 100),
+    ('add', 1): (operator.add, 'myfile.1.csv', 100),
+    ('add', 2): (operator.add, 'myfile.2.csv', 100),
+    ('add', 3): (operator.add, 'myfile.3.csv', 100),


As in the HighLevelGraph docstring, guessing 'myfile.*.csv' should be ('read-csv', *).

jakirkham · 2018-10-31T17:45:07Z

docs/source/high-level-graphs.rst

+    ('add', 0): (operator.add, 'myfile.0.csv', 100),
+    ('add', 1): (operator.add, 'myfile.1.csv', 100),
+    ('add', 2): (operator.add, 'myfile.2.csv', 100),
+    ('add', 3): (operator.add, 'myfile.3.csv', 100),


As in the HighLevelGraph docstring, guessing 'myfile.*.csv' should be ('read-csv', *).

jakirkham · 2018-10-31T17:45:24Z

docs/source/high-level-graphs.rst

+    'add': {('add', 0): (operator.add, 'myfile.0.csv', 100),
+            ('add', 1): (operator.add, 'myfile.1.csv', 100),
+            ('add', 2): (operator.add, 'myfile.2.csv', 100),
+            ('add', 3): (operator.add, 'myfile.3.csv', 100)}


As in the HighLevelGraph docstring, guessing 'myfile.*.csv' should be ('read-csv', *).

jakirkham · 2018-10-31T17:45:55Z

docs/source/high-level-graphs.rst

+      'add': {('add', 0): (operator.add, 'myfile.0.csv', 100),
+              ('add', 1): (operator.add, 'myfile.1.csv', 100),
+              ('add', 2): (operator.add, 'myfile.2.csv', 100),
+              ('add', 3): (operator.add, 'myfile.3.csv', 100)}


As in the HighLevelGraph docstring, guessing 'myfile.*.csv' should be ('read-csv', *).

jakirkham · 2018-10-31T17:58:18Z

Thanks for the follow-up, @mrocklin.

Caught a few more nits. Should be easy fixes though. Otherwise am pretty happy with this as is.

Also thanks for the clarification regarding intent of the PR. Am happy with how this simplifies the graph construction code. Was pleasing to read these much shorter and clear lines in a few places to just gain an appreciation of that effect alone.

Had some other ideas about optimizations we might try that we can discuss back in the issue.

Not to worried about the ShareDict deprecation either. Mainly mentioned as a smoother path might be helpful for the advanced users that are employing ShareDict generously in their codebases. Though there is really no substitute for migrating at some point.

jakirkham · 2018-11-27T20:17:22Z

Where do you think this fits relative to 1.0.0?

mrocklin · 2018-11-27T20:20:15Z

Definitely after 1.0.0. I would like to release something relatively stable for 1.0.0 (probably tomorrow morning?), and then go a bit crazy.

…

On Tue, Nov 27, 2018 at 3:17 PM jakirkham ***@***.***> wrote: Where do you think this fits relative to 1.0.0? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4092 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLBQCE035S62hgnT5UBuoAK_SYtxks5uzZ3TgaJpZM4XaPDG> .

mrocklin · 2018-12-07T21:54:35Z

If there are no objections then I plan to merge this on Monday.

wip - add high level graph

f2048de

shoyer reviewed Oct 13, 2018

View reviewed changes

mrocklin added 9 commits October 14, 2018 14:40

handle collections in atop, use atop in map_blocks

aca0028

Use HighGraph in Dask DataFrame

06a6ff2

fix more tests

de3f426

use highgraph in bag

663f8e7

cleanup

855a0e8

clean up array creation, gufunc, and linalg

4abd015

fix overlap

8dfdce2

add missing dependency in tsqr

1e784f7

add highgraph visualize

c1a4d76

mrocklin changed the title ~~WIP - High Level Graphs~~ High Level Graphs Oct 15, 2018

mrocklin added 8 commits October 15, 2018 12:30

Merge branch 'master' into high-level-graphs

e6e9efb

replace sharedict in array optimization

6059f6a

rewrite drop/new axis map_blocks behavior for atop

86da50a

array cleanup

4c30f2c

unpack subgraph callable

9ce8522

fix bag reductions

a0015d9

avoid deprecation warning

1cd7919

support collections with non-highgraphs

5155238

jcrist reviewed Oct 24, 2018

View reviewed changes

mrocklin added 6 commits October 24, 2018 15:43

Merge branch 'master' of github.com:dask/dask into high-level-graphs

d83302d

handle_graphviz -> graphviz_to_file

af53afe

use cls rather than HighGraph

695750a

add docstring to HighGraph.from_collections

659ef10

return a tuple rather than set in unpack_collections

365b3ff

rename HighGraph to HighLevelGraph

b21178a

This was referenced Oct 30, 2018

High level expression optimization #4038

Closed

Draft of symbolic arrays mrocklin/symbolic-array#1

Open

jakirkham reviewed Oct 30, 2018

View reviewed changes

mrocklin added 2 commits October 31, 2018 10:13

cleanup

c50aa0a

Merge branch 'master' of github.com:dask/dask into high-level-graphs

da0fd32

jakirkham reviewed Oct 31, 2018

View reviewed changes

mrocklin mentioned this pull request Nov 7, 2018

ShareDict seems to cause issues with HighLevelGraphs in optimize_atop #4176

Closed

mrocklin added 2 commits November 16, 2018 11:46

Merge branch 'master' into high-level-graphs

679d439

support dataclasses

48e0031

mrocklin mentioned this pull request Nov 20, 2018

Use atop fusion in dask dataframe #4229

Merged

2 tasks

mrocklin merged commit 9f870d1 into dask:master Dec 10, 2018

mrocklin deleted the high-level-graphs branch December 10, 2018 23:33

shoyer mentioned this pull request Dec 11, 2018

xarray test failures with dask-dev #4291

Closed

joegoldbeck mentioned this pull request Jan 31, 2019

Backwards incompatible change to map_partitions in v1.1.0 #4446

Closed

jakirkham mentioned this pull request Feb 28, 2019

Import dask.sharedict nanshe-org/nanshe_workflow#314

Merged

rabernat mentioned this pull request Mar 25, 2019

Forget history and task overhead investigation #4630

Open

CMCDragonkai mentioned this pull request Apr 29, 2019

Attribute Error from dask 1.1.1 superlinear-ai/graphchain#35

Closed

jrbourbeau mentioned this pull request Aug 7, 2019

Fixes typo in HLG example graph #5237

Merged

jakirkham mentioned this pull request Sep 9, 2020

[REVIEW] Dask Array's store to return a single HLG layer #6601

Merged

2 tasks

crusaderky mentioned this pull request Jan 25, 2021

Clean up the HighlevelGraph.dicts property #7108

Closed

rjzamora mentioned this pull request Jul 14, 2021

RFC: Collection-Aware HLG Layer Design for Dataframe Column Projection #7897

Closed

		return handle_graphviz(g, filename, format)


		def handle_graphviz(g, filename, format):

Uh oh!

High Level Graphs #4092

High Level Graphs #4092

Uh oh!

Conversation

mrocklin commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Oct 15, 2018

Low level

High level

Uh oh!

mrocklin commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Oct 30, 2018

Uh oh!

mrocklin commented Oct 31, 2018

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Oct 31, 2018

Uh oh!

jakirkham commented Nov 27, 2018

Uh oh!

mrocklin commented Oct 12, 2018 •

edited

Loading

mrocklin commented Oct 12, 2018 •

edited

Loading

mrocklin commented Oct 15, 2018 •

edited

Loading

jakirkham Oct 30, 2018 •

edited

Loading

jakirkham Oct 30, 2018 •

edited

Loading