Refactor HighLevelGraph Layers to use ToPickle and reduce boilerplate by rjzamora · Pull Request #8672 · dask/dask

rjzamora · 2022-02-07T18:40:34Z

This PR refactors Layer and its subclasses to simplify serialization and reduce redundant graph-caching and culling logic. These changes were motivated by the serialization change in distributed#5728 (which proposes a new ToPickle protocol).

This PR ultimately demonstrates that new Layer implementations can be dramatically simplified if we go "all in" on the new ToPickle protocol, and just require that pickle.loads to be allowed on the scheduler for any (non-MaterializedLayer) Layer instances to be sent to and materialized on the scheduler. That is, for cases where the "scheduler.pickle" config option is set to False, we will simply convert all HighLevelGraph layers to MaterializedLayer instances before communicating them to the scheduler.

TODO:

Update base Layer class to implement ToPickle serialization, basic culling, and graph-caching logic
Revise all DataFrame-specific Layers to inherit from the centralized serialization/culling/caching logic
Revise Blockwise to inherit from the centralized serialization/caching logic
(BLOCKER) Get distributed#5728 merged (This PR will "work" without ToPickle, but all layers will be converted to MaterializedLayers on the client)
Roll back temporary CI-config changes to point back to main branch of distributed
Establish if/how backward compatibility (pre-ToPickle) should be handled (I am thinking we can just convert all Layers to MaterializedLayer objects on the client). [Current Solution: Convert HLG Layers to MaterializedLayer objects when distributed.protocol.serialize.ToPickle is not available.]
Establish clear configuration option for scheduler- vs client-side materialization. The indirect nature of the current "optimization.fuse.active" option is confusing. [Current Solution: Use existing "distributed.scheduler.pickle" config option. When this is set to False, we always convert HLG Layers to MaterializedLayer objects before sending to the scheduler.]

Follow-up Work:

(Try to) Remove __dask_distributed_pack__ and __dask_distributed_unpack__ from code-base completely (since packing will be identical for all materialized layers, and a single ToPickle call for unmaterialized layers)
Redistribute the collection-specific code defined in layers.py (Maybe?)
Add/Revise "how-to" developer documentation for implementing a new Layer

dask/highlevelgraph.py

…ata logic

… to port this Layer in a follow-up PR)

dask/tests/test_distributed.py

mrocklin · 2022-02-11T15:34:51Z

dask/highlevelgraph.py

+        attrs = list(self.layer_state.keys()) + ["annotations"]
+        return (self.__class__, tuple(getattr(self, attr) for attr in attrs))
+
+    def __dask_distributed_pack__(self, all_hlg_keys, *args, **kwargs):


Is it possible to remove this protocol?

Yes - We should be able to remove this protocol next, since there should only be two different cases now: a "materialized" layer, or an "abstract" layer (requiring a single ToPickle call). However, I would prefer to leave this change for a follow-up PR (mostly to make this PR easier to review).

dask/highlevelgraph.py

…pickle

rjzamora · 2022-03-22T18:11:15Z

@ian-r-rose @gjoseph92 - Any interest in reviewing this (or know who should)? :)

ian-r-rose

Thanks @rjzamora, this is heroic work. I love a PR that has more deletions than additions. I'm still digesting this, but here are some early thoughts.

My biggest concern is around the interface of the base Layer. It now has some fairly strong assumptions about graph and partitioning structure (see, e.g., output_blocks, _keys_to_indices) which is not always true. An immediate consequence of this is that we need to allow output_blocks to be None everywhere to handle those cases. Right now that is mostly MaterializedLayer, but I could imagine other, weirder things involving Delayed layers or something. To me, this indicates that the Layer class is trying to do a bit too much. I understand that part of what you are trying to do here is consolidate logic, so that impulse makes a lot of sense! But it feels a bit too far right now.

I'm still thinking about this, and may have misunderstood something fundamental. But my instinct here is to actually have most of your Layer implementation be something like a PartitionedLayer, which handles a lot of logic around partitioned collections, output blocks, keys -> names -> indices -> keys translations, etc. And both that and MaterializedLayer could satisfy an abstract Layer interface that tries hard to be as close as possible to a pure Mapping interface + cull.

dask/blockwise.py

dask/highlevelgraph.py

ian-r-rose · 2022-03-24T00:29:20Z

dask/highlevelgraph.py

        return keys_in_tasks(all_hlg_keys, [self[key]])

+    def __copy__(self):
+        """Default shallow copy implementation"""


Who is using this?

Not sure - This isn't new code. Just moved

ian-r-rose · 2022-03-24T00:36:53Z

dask/highlevelgraph.py

+
+        # If pickle is disabled on the scheduler, all layers must
+        # be converted to `MaterializedLayer` objects before packing
+        materialize = not config.get("distributed.scheduler.pickle")


It's unfortunate that this config value doesn't really have any bearing on whether pickling is actually enabled on the scheduler. Can we instead add some metadata to the comm or similar and check whether the scheduler has it enabled?

Yeah - I guess it is possible for the config setting to be different on the scheduler and client, but I suspect this will cover most cases.

When the client tries to send pickled data to the scheduler, the scheduler will throw an error. Perhaps we should just update this error in distributed to clarify that the "pickle" config must be True on both the client and scheduler?

I suspect this will cover most cases

I think this is true for local clusters, but if there is a remote cluster somewhere with pickling disabled, I expect it to fail more often than it succeeds (simply because most users won't have a local config for this).

The error message and documentation may be enough, though I still think a better user experience would be to introspect something on the client (which is an arg here) to determine whether to pickle something. Happy to leave that discussion for a follow-up, however, since it's separate from API design.

The error message and documentation may be enough, though I still think a better user experience would be to introspect something on the client (which is an arg here) to determine whether to pickle something. Happy to leave that discussion for a follow-up, however, since it's separate from API design.

I agree with the user-experience concern. Even with a clear error message, it will probably be confusing to run into serialization errors when the external cluster has pickling disabled (however rare it may be for people to explicitly disable pickling). I don't personally have any bright ideas about how we can introspect this information on the client. Do you think we need to implement something new to make this information available - I'm hoping we dont need to do something annoying, like ping the scheduler with a pickled test packet.

Do you think we need to implement something new to make this information available - I'm hoping we dont need to do something annoying, like ping the scheduler with a pickled test packet.

Yeah, we probably would need to implement something new, I don't think this is possible today (at least, not without pickling a test packet as you suggest). But I don't think there is a fundamental reason it couldn't be available. But I'm happy to defer this discussion!

dask/highlevelgraph.py

dask/tests/test_layers.py

dask/highlevelgraph.py

rjzamora · 2022-03-24T16:53:03Z

Thanks for taking the time tolook through this @ian-r-rose !

I completely agree with your concerns here about the Layer class trying to do so much. For example, it does feel a bit “forced” to have the output_blocks attribute living in the base Layer class. I originally wanted to define a PartitionedLayer class, but ended up putting everything in Layers because it reduced the amount of code. 
In hindsight - I do agree that we should distinguish collection-specific layers a bit more clearly. (And so I just pushed some changes to do this)

Although I do want a PartitionedLayer class, I do think it is worth considering that most layer definitions probably can support a concept very similar to “blocks”. The "big-data" collections use blocks/partitions literally, while (almost) everything else can be described with the same language if we want. For example, every Delayed object is really just a single block/partition layer, and from_delayed is just mapping each of these single-block layers onto distinct output blocks. I think the only problem arises when you consider that a raw user-defined graph may have an arbitrary number (and naming scheme) for “output”
tasks.

ian-r-rose · 2022-03-24T17:54:22Z

Although I do want a PartitionedLayer class, I do think it is worth considering that most layer definitions probably can support a concept very similar to “blocks”. The "big-data" collections use blocks/partitions literally, while (almost) everything else can be described with the same language if we want.

Fully agree here -- most of the cases we care about for big data assume some sort of partitioning. I wanted to raise it here for a couple of reasons:

defining abstract base classes is exactly the right time to ask "what is the minimum interface I need to accomplish the goal?"
While most of the important cases assume some partitioning, the current implementation makes a stronger assumption than that about key structure. I'm specifically thinking about _key_to_parts, which assumes that all keys can be unpacked into name, part. Of course, this is satisfied today by array and dataframe collections, but baking that into the API of a graph makes it harder to experiment with other partitioning representations. While I don't necessarily foresee that happening any time soon, I wouldn't want to to preclude without discussion.

Edit: another take (disagreeing with mine) on whether Layers should continue looking like Mappings from @jcrist : #7933 (comment)

rjzamora · 2022-03-25T13:41:34Z

Edit: another take (disagreeing with mine) on whether Layers should continue looking like Mappings from @jcrist : #7933 (comment)

I don't have strong feelings (yet) about whether a Layer should continue looking like a Mapping, but I do agree that there is no great reason to push the construct_graph/caching approach all the way down to the base Layer definition in this PR. For now, it doesn't really hurt to preserve the current "flexibility", but urge most developers to inherit from a simplified class (like PartitionedLayer) if/when they want to implement their own subclass.

mrocklin · 2022-03-29T21:36:22Z

dask/highlevelgraph.py

+    def __dask_distributed_pack__(
+        self,
+        all_hlg_keys: Iterable[Hashable],
+        known_key_dependencies: Mapping[Hashable, set],
+        client,
+        client_keys: Iterable[Hashable],
+    ) -> Any:
+


Why is this protocol still necessary? Could we just lift the entire graph and send it to the scheduler? What stops this?

So, to make this concrete, how about the following diff.

diff --git a/distributed/client.py b/distributed/client.py index f68570f7..bc3afc75 100644 --- a/distributed/client.py +++ b/distributed/client.py @@ -2904,7 +2904,6 @@ class Client(SyncMethodMixin): # Pack the high level graph before sending it to the scheduler keyset = set(keys) - dsk = dsk.__dask_distributed_pack__(self, keyset, annotations) # Create futures before sending graph (helps avoid contention) futures = {key: Future(key, self, inform=False) for key in keyset} @@ -2912,7 +2911,7 @@ class Client(SyncMethodMixin): self._send_to_scheduler( { "op": "update-graph-hlg", - "hlg": dsk, + "hlg": ToPickle(dsk), "keys": list(map(stringify, keys)), "priority": priority, "submitting_task": getattr(thread_state, "key", None),

I suspect that this doesn't actually work today, but I'd be curious why not, and if it's not doable.

Why is this protocol still necessary? Could we just lift the entire graph and send it to the scheduler? What stops this?

Thanks for looking at this @mrocklin! I'll try to catch up on your various discussions/work from last week as soon as I can.

The short answer is that we absolutely can pickle the entire graph as you are suggesting. Continuing to use __dask_distributed_pack__ at the Layer level for this PR was only meant to be a simpler intermediate step.

My (possibly wrong) impression, was that continuing to use the Layer-by-Layer approach makes it a bit easier to materialize layers without loosing annotations. For now, I do get the sense that we should preserve the option for the user to disable pickle on the scheduler, because I do know of some cases (like Dask-sql CI) where the scheduler has a different python environment that then client/workers.

Is your suggestion that (1) we should require pickling to be enabled, (2) that we should drop annotations when pickling is disabled, or (3) something else?

Note that I am personally still a bit interested in seeing the scheduler written in another language at some point, so I like the idea of layers being "materialized" individually (whether it is in a dedicated __dask_distributed_pack__ routine or not)

mrocklin · 2022-04-04T15:30:51Z

I'm inclined to pursue option (1). I think that it is the simplest, and right now I get the sense that development is bound by complexity rather than other factors.

…

On Mon, Apr 4, 2022 at 9:12 AM Richard (Rick) Zamora < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask/highlevelgraph.py <#8672 (comment)>: > + def __dask_distributed_pack__( + self, + all_hlg_keys: Iterable[Hashable], + known_key_dependencies: Mapping[Hashable, set], + client, + client_keys: Iterable[Hashable], + ) -> Any: + Why is this protocol still necessary? Could we just lift the entire graph and send it to the scheduler? What stops this? Thanks for looking at this @mrocklin <https://github.com/mrocklin>! I'll try to catch up on your various discussions/work from last week as soon as I can. The short answer is that we absolutely *can* pickle the entire graph as you are suggesting. Continuing to use __dask_distributed_pack__ at the Layer level for this PR was only meant to be a simpler intermediate step. My (possibly wrong) impression, was that continuing to use the Layer-by-Layer approach makes it a bit easier to materialize layers without loosing annotations. For now, I do get the sense that we should preserve the option for the user to disable pickle on the scheduler, because I do know of some cases (like Dask-sql CI) where the scheduler has a different python environment that then client/workers. Is your suggestion that (1) we should require pickling to be enabled, (2) that we should drop annotations when pickling is disabled, or (3) something else? Note that I am personally still a bit interested in seeing the scheduler written in another language at some point, so I like the idea of layers being "materialized" individually (whether it is in a dedicated __dask_distributed_pack__ routine or not) — Reply to this email directly, view it on GitHub <#8672 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTH5JCLQO44HMXETBB3VDL2D5ANCNFSM5NYII2PA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rjzamora added 10 commits February 3, 2022 10:44

add AbstractLayer class and simplify DataFrameTreeReduction

323ca7e

use pickle.dumps for now to focus on Layer cleanup

dd0d5b7

centralize culling

b9f696a

re-order methds to make required defs more obvious

f7894f3

update docstrings

0c2912a

Start expanding AbstractLayer to cover blockwise

af0438d

use cloudpickle and add reconstructor classmethod

da298be

always use cloudpickle for now

04e27a3

remove pack/unpack from BlockwiseDep implementations

956c62b

relax test_layers.py

dfb8cfd

github-actions bot added dataframe io labels Feb 7, 2022

rjzamora commented Feb 7, 2022

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

rjzamora mentioned this pull request Feb 7, 2022

Serialize High Level Layers with Pickle dask/distributed#5581

Closed

rjzamora added 3 commits February 9, 2022 13:31

ensure all layers are materialized when pcikle config is False

f7f5a0d

update comment

976645c

Merge remote-tracking branch 'upstream/main' into to_pickle

93d0cda

rjzamora mentioned this pull request Feb 10, 2022

[REVIEW] ToPickle - Unpickle on the Scheduler dask/distributed#5728

Merged

2 tasks

rjzamora added 6 commits February 10, 2022 06:57

Start using ToPickle in CI

7eae76d

remove existing cloudpickle logic and add comment about state=state.d…

f44cd0f

…ata logic

support older distributed versions

c6c0f2a

try to fix conda yml files

1132361

add back __dask_distributed_unpack__ logic to ArrayOverlapLayer (need…

3bd3c42

… to port this Layer in a follow-up PR)

fix len bug

010945e

rjzamora commented Feb 10, 2022

View reviewed changes

dask/tests/test_distributed.py Outdated Show resolved Hide resolved

mrocklin reviewed Feb 11, 2022

View reviewed changes

rjzamora added 4 commits February 16, 2022 11:08

Merge remote-tracking branch 'upstream/main' into to_pickle

e717b29

materialize layers in separate loop before packing

345ad03

move AbstractLayer logic into Layer

21175c5

Merge remote-tracking branch 'upstream/main' into to_pickle

e6f7ea4

rjzamora added 2 commits February 24, 2022 07:42

remove deserializing arg from blockwise

6b12532

add try-except for packing Layer with layer_dependencies

541e698

rjzamora marked this pull request as ready for review February 24, 2022 17:49

rjzamora added the needs review Needs review from a contributor. label Mar 1, 2022

This was referenced Mar 3, 2022

Series Aggregation called Multiple Times, Must Support Recursive Nature #8773

Closed

Fix serialization bug in DataFrameTreeReduction #8786

Merged

Merge remote-tracking branch 'upstream/main' into to_pickle

2528b24

rjzamora mentioned this pull request Mar 10, 2022

TypeError("Serialize' object is not callable") with "ddf.from_pandas" #8764

Closed

rjzamora added 5 commits March 18, 2022 07:37

Merge remote-tracking branch 'upstream/main' into to_pickle

c6c104c

unpin to distributed pr

222df70

Merge branch 'to_pickle' of https://github.com/rjzamora/dask into to_…

e342bf8

…pickle

remove duplicate clone def

c41a4c5

Merge remote-tracking branch 'upstream/main' into to_pickle

6d85de4

rjzamora added the highlevelgraph Issues relating to HighLevelGraphs. label Mar 22, 2022

ian-r-rose self-requested a review March 22, 2022 20:08

ian-r-rose reviewed Mar 24, 2022

View reviewed changes

rjzamora added 2 commits March 24, 2022 08:25

minor fixes

c463ee3

add PartitionedLayer class

26c9dcc

fix name bug

d0d4e07

mrocklin reviewed Mar 29, 2022

View reviewed changes

ian-r-rose mentioned this pull request Apr 25, 2022

High level layer structure #8980

Open

bryanwweber mentioned this pull request May 3, 2022

Blockwise serialization can fail with LocalCluster(processes=False) #8581

Open

ian-r-rose mentioned this pull request May 19, 2022

Msgpack TypeError when performing dask-geopandas sjoin #9072

Closed

ian-r-rose mentioned this pull request May 31, 2022

DataFrameIOLayer becomes MaterializedLayer when pickled #9141

Closed

rjzamora closed this Jun 4, 2024

Uh oh!

Conversation

rjzamora commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora commented Mar 22, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Mar 24, 2022

Uh oh!

ian-r-rose commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Mar 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 4, 2022 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rjzamora commented Feb 7, 2022 •

edited

Loading

rjzamora Mar 24, 2022 •

edited

Loading

ian-r-rose commented Mar 24, 2022 •

edited

Loading