HighLevelGraphs to the Scheduler by madsbk · Pull Request #4140 · dask/distributed

madsbk · 2020-10-01T15:13:24Z

This PR implements pack and unpack of send high level graphs, which makes it possible for the Client to send a HLG directly to the scheduler.

distributed/client.py

distributed/protocol/serialize.py

jakirkham · 2020-10-13T23:13:35Z

What about having serialize and deserialize methods on these objects that we call instead? That would avoid needing pickling and the alluded to security concerns around it.

madsbk · 2020-10-14T08:43:07Z

What about having serialize and deserialize methods on these objects that we call instead? That would avoid needing pickling and the alluded to security concerns around it.

I think this is essential what we are doing right now just using __reduce__() instead of serialize() ?

jrbourbeau · 2020-10-22T14:32:16Z

Woo, nice green check marks 🎉

madsbk · 2020-10-27T12:51:45Z

I have been working on this PR some time now. Initially, my idea was to implement and use map_tasks() from dask/dask#6689 to handle preprocessing of the high-level graph and use the Layer.__reduce__() methods to handle serialization before sending it to the scheduler. This is a clean generic approach, but it doesn’t work. Distributed requires a very specific protocol when updating the scheduler task graph, which makes it very hard (if not impossible) to implement generically.

I have compiled a list of requirements and thoughts:

We should support BasicLayer efficiently
Converting all keys to strings requires domain knowledge in layers like Blockwise where all keys aren't materialized.
A layer needs ALL keys in the HLG in order to calculate dependencies.
Dependencies must be calculated before dumps_task() thus dependency calculation must be part of serialization. Some layers can infer key dependencies but consider layers such as Blockwise where kwargs can contain any key. Not to mention BasicLayer, which has no prior dependency knowledge.
Before serialization, we need to unpack remote data (Futures). This requires map_tasks() to provide key information, which can be problematic: [WIP] Add optional pass_key argument to map_tasks dask#6761 (comment).
Additionally, we need to search for Futures inside tuples, which map_tasks() cannot support and we need to in cooperate the dependencies of the unpacked Futures into the existing key dependencies without having to recalculate anything.
Currently, ordering has to be done on the original keys and not after they have been converted to string: Prioritize tasks with their true keys, before stringifying #2006

Because of these requirements, I am exploring a new approach where the serialization and deserialization of layers is specialized. I introduce dumps_highlevelgraph() that do all the preprocessing (unpack Futures, convert keys to strings, calculate dependencies that can cannot be inferred on the scheduler, and dump tasks) and loads_highlevelgraph() that deserialize the required and return a materialized graph and all key dependencies.

The current implementation does all this in Distributed however we should consider move it to the Layer classes in Dask. But notice, these serialization methods will be very Distributed specific and use a lot of functions and classes from Distributed.

NB: the current push is just a rough implementation that always falls back to BasicLayer serialization.

mrocklin · 2020-10-28T15:22:49Z

@madsbk if you're around and want to have a high bandwidth conversation I would be interested. Some quick comments/questions

Converting all keys to strings requires domain knowledge in layers like Blockwise where all keys aren't materialized.

Materializing all keys might be ok. Ideally we don't have to materialize the entire graph until we get to the scheduler.

We could also convert to strings after we get to the scheduler and have the full graph.

A layer needs ALL keys in the HLG in order to calculate dependencies.

Again, I think that materializing keys will be ok for now (this is, I think, likely to be faster than the graph). In the future I hope that we can avoid cases where we need to calculate dependencies in the common case.

Dependencies must be calculated before dumps_task() thus dependency calculation must be part of serialization. Some layers can infer key dependencies but consider layers such as Blockwise where kwargs can contain any key. Not to mention BasicLayer, which has no prior dependency knowledge.

Hrm, I'm curious about this. I'll need to think about this more.

Before serialization, we need to unpack remote data (Futures). This requires map_tasks() to provide key information, which can be problematic: dask/dask#6761 (comment).

Also curious.

Additionally, we need to search for Futures inside tuples, which map_tasks() cannot support and we need to in cooperate the dependencies of the unpacked Futures into the existing key dependencies without having to recalculate anything.

We can drop this if we need to

Currently, ordering has to be done on the original keys and not after they have been converted to string: #2006

Ah, right. Well, maybe we can change how we stringify in order to keep ordering. Maybe we do both ordering and stringifying on the scheduler after we've moved everything over.

madsbk · 2020-10-29T14:46:57Z

Materializing all keys might be ok. Ideally we don't have to materialize the entire graph until we get to the scheduler.

Notice, the keys here also refers to the keys inside tasks. Since tasks might not exist yet, it requires domain knowledge to stringify them.

We could also convert to strings after we get to the scheduler and have the full graph.

True, it is properly possible to serialize keys using msgpack in most cases, but dumping tasks before stringify means that the dumped task may contain non-stringified keys.

madsbk · 2020-11-02T17:42:18Z

Moved packing of shuffle layers to Dask: dask/dask#6786 thus CI will fail until it has been merged.

…h_hlg

mrocklin · 2020-11-03T15:22:16Z

distributed/protocol/highlevelgraph.py

+        layers.append(
+            {
+                "__module__": None,
+                "__name__": None,
+                "state": _materialized_layer_pack(
+                    layer,
+                    hlg.get_all_external_keys(),
+                    hlg.key_dependencies,
+                    allowed_client,
+                    allows_futures,
+                ),
+            }
+        )


What if this was the implementation of Layer.__dask_distributed_pack__? That might simplify things a bit here

We would import the relevant distributed functions within the method in the Dask codebase.

It's not a big deal either way, but it might isolate things a bit.

I think it is best to have the fall back in Distributed for now. Long-term it would be nice to have it in Layer.__dask_distributed_pack__() but right now it requires a lot of very Distributed specific things like key_dependencies, allowed_client, and allows_futures.
Furthermore, my plan is to also incorporate dsk.map_basic_layers() into distributed_pack() in order to remove both map_basic_layers() and map_tasks() from the HLG API.

…h_hlg

mrocklin · 2020-11-04T20:01:51Z

@madsbk is there anything that I can do or should look at to help here?

madsbk · 2020-11-04T20:28:30Z

@madsbk is there anything that I can do or should look at to help here?

The final issue is ordering, which has to be done before converting keys to strings. This is what is failing in CI.

mrocklin · 2020-11-04T21:46:07Z

If we do ordering on the scheduler side after stringification I suspect that everything will work the same in almost all cases. I would be willing to take the hit on those cases in order to move forward. Is this an easy option for you?

madsbk · 2020-11-04T22:25:42Z

Yes, that would make this PR almost ready. Should I add a pytest.xfail() on the affected tests?

…

On Wed, Nov 4, 2020 at 10:46 PM Matthew Rocklin ***@***.***> wrote: If we do ordering on the scheduler side after stringification I suspect that everything will work the same in almost all cases. I would be willing to take the hit on those cases in order to move forward. Is this an easy option for you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAH6FQFRGAUSTJAGCZ6YK6TSOHDSZANCNFSM4SAR7EMQ> .

mrocklin · 2020-11-04T23:16:27Z

That seems sensible to me

madsbk · 2020-11-05T12:17:57Z

@mrocklin, I managed to support ordering but it requires a fancy string comparison in order.order(): dask/dask#6807

mrocklin · 2020-11-05T13:46:26Z

Let's xfail the test for now, include a link to the order PR in the reason= keyword, and then merge this.

I want to think a bit about prefixing with zeroes before we commit to it.

madsbk · 2020-11-05T20:28:39Z

@mrocklin @jrbourbeau, I think this is ready to be merged. It lacks documentation of the pack/unpack API but let's wait with that until we settle on an API.

mrocklin · 2020-11-05T20:37:13Z

I'll take another look in a bit and then merge in.

mrocklin · 2020-11-05T20:57:51Z

This looks slick. Thank you for figuring this out @madsbk

mrocklin · 2020-11-06T13:51:50Z

@sjperkins this is in. Now would probably be a good time to add in support for annotations. I think that they should go in client.py::Client._graph_to_futures

sjperkins · 2020-11-06T13:53:29Z

@sjperkins this is in. Now would probably be a good time to add in support for annotations. I think that they should go in client.py::Client._graph_to_futures

@mrocklin Thanks for the headsup. My schedule is busy till next week Thursday. Would it be OK if I picked it up then?

mrocklin · 2020-11-06T14:17:35Z

I don't pay you, so sure :)

…

On Fri, Nov 6, 2020 at 5:53 AM Simon Perkins ***@***.***> wrote: @sjperkins <https://github.com/sjperkins> this is in. Now would probably be a good time to add in support for annotations. I think that they should go in client.py::Client._graph_to_futures @mrocklin <https://github.com/mrocklin> Thanks for the headsup. My schedule is busy till next week Thursday. Would it be OK if I picked it up then? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGMNK4VX2HSOF5HHB3SOP5XNANCNFSM4SAR7EMQ> .

)" This reverts commit 09d9799.

madsbk mentioned this pull request Oct 1, 2020

Serialization of layers dask/dask#6693

Merged

2 tasks

mrocklin reviewed Oct 2, 2020

View reviewed changes

distributed/client.py Outdated Show resolved Hide resolved

mrocklin reviewed Oct 2, 2020

View reviewed changes

distributed/client.py Outdated Show resolved Hide resolved

mrocklin reviewed Oct 2, 2020

View reviewed changes

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

madsbk mentioned this pull request Oct 8, 2020

Revert "Revert "Use HighLevelGraph layers everywhere in collections (… dask/dask#6707

Merged

rjzamora mentioned this pull request Oct 8, 2020

Add optional IO-subgraph to Blockwise Layers dask/dask#6715

Merged

6 tasks

sjperkins mentioned this pull request Oct 13, 2020

Layer Annotations dask/dask#6701

Closed

madsbk force-pushed the update_graph_hlg branch from 5bf6cee to 08a0b8c Compare October 20, 2020 09:07

madsbk mentioned this pull request Oct 20, 2020

High level graph dumps/loads support #4174

Merged

3 tasks

madsbk force-pushed the update_graph_hlg branch from 08a0b8c to 34e0612 Compare October 22, 2020 11:41

madsbk force-pushed the update_graph_hlg branch from 076b36a to cdc2f67 Compare October 22, 2020 14:54

sjperkins mentioned this pull request Oct 26, 2020

Layer Annotations dask/dask#6767

Merged

2 tasks

madsbk force-pushed the update_graph_hlg branch from cdc2f67 to b9e2980 Compare October 27, 2020 11:44

madsbk mentioned this pull request Oct 27, 2020

Remove graph expansions from graph_to_futures #4187

Closed

madsbk force-pushed the update_graph_hlg branch from b9e2980 to 02c182a Compare October 30, 2020 09:13

madsbk added 4 commits October 30, 2020 16:18

Scheduler: added update-graph-hlg message handle

5044bfe

HLG dumps/loads

10a7a3b

Serialization of ShuffleLayer

f5e4da5

handle ShuffleLayer arguments

c4a816a

madsbk force-pushed the update_graph_hlg branch from 3914b46 to c4a816a Compare October 30, 2020 15:18

madsbk added 2 commits November 2, 2020 18:33

now using Layer's pack/unpack methods if available

b3f8ad5

Renamed to highlevelgraph_pack() and _unpack()

909ec69

Renamed __dask_distributed_pack__()

2f375e6

madsbk added 2 commits November 3, 2020 15:39

moved tests to test_highlevelgraph.py

9a05049

Merge branch 'master' of github.com:dask/distributed into update_grap…

4d2dab8

…h_hlg

mrocklin reviewed Nov 3, 2020

View reviewed changes

Merge branch 'master' of github.com:dask/distributed into update_grap…

06d4ad2

…h_hlg

madsbk force-pushed the update_graph_hlg branch from d29755e to 4f95df7 Compare November 5, 2020 10:27

Scheduler: use dependencies when order tasks

cae685a

madsbk force-pushed the update_graph_hlg branch from 4f95df7 to cae685a Compare November 5, 2020 11:39

madsbk mentioned this pull request Nov 5, 2020

Ordering using natural string order comparison dask/dask#6807

Closed

2 tasks

xfail test_nested_prioritization

2f4144d

madsbk marked this pull request as ready for review November 5, 2020 14:24

madsbk changed the title ~~[WIP] HighLevelGraphs to the Scheduler~~ HighLevelGraphs to the Scheduler Nov 5, 2020

madsbk added 2 commits November 5, 2020 15:52

fixed typo

0393a5f

Moved future substitute into highlevelgraph_pack()

10f9777

mrocklin merged commit 09d9799 into dask:master Nov 5, 2020

madsbk deleted the update_graph_hlg branch November 6, 2020 07:58

sonicxml added a commit to sonicxml/distributed that referenced this pull request Dec 3, 2020

Revert "Communicate HighLevelGraphs directly to the Scheduler (dask#4140

2f2055c

)" This reverts commit 09d9799.

Uh oh!

Conversation

madsbk commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakirkham commented Oct 13, 2020

Uh oh!

madsbk commented Oct 14, 2020

Uh oh!

jrbourbeau commented Oct 22, 2020

Uh oh!

madsbk commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Oct 28, 2020

Uh oh!

madsbk commented Oct 29, 2020

Uh oh!

madsbk commented Nov 2, 2020

Uh oh!

mrocklin Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

madsbk Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 4, 2020

Uh oh!

madsbk commented Nov 4, 2020

Uh oh!

mrocklin commented Nov 4, 2020

Uh oh!

madsbk commented Nov 4, 2020 via email

Uh oh!

mrocklin commented Nov 4, 2020

Uh oh!

madsbk commented Nov 5, 2020

Uh oh!

mrocklin commented Nov 5, 2020

Uh oh!

madsbk commented Nov 5, 2020

Uh oh!

mrocklin commented Nov 5, 2020

Uh oh!

mrocklin commented Nov 5, 2020

Uh oh!

mrocklin commented Nov 6, 2020

Uh oh!

sjperkins commented Nov 6, 2020

Uh oh!

mrocklin commented Nov 6, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

madsbk commented Oct 1, 2020 •

edited

Loading

madsbk commented Oct 27, 2020 •

edited

Loading