High level graph pack/unpack for Distributed by madsbk · Pull Request #6786 · dask/dask

madsbk · 2020-11-02T12:17:59Z

This PR introduce packing and unpacking of HLG layer:

    def distributed_pack(self) -> Optional[Any]:
        """Pack the layer for scheduler communication in Distributed

        This method should pack its current state and is called by the Client when
        communicating with the Scheduler.
        The Scheduler will then use .distributed_unpack(data, ...) to unpack the
        state, materialize the layer, and merge it into the global task graph.

        The returned state must be compatible with Distributed's scheduler, which
        means it must obey the following:
          - Serializable by msgpack
          - All remote data must be unpacked (see unpack_remotedata())
          - All keys must be converted to strings now or when unpacking
          - All tasks must be serialized (see dumps_task())

        Alternatively, the method can return None, which will make Distributed
        materialize the layer and use a default packing method.

        Returns
        -------
        state: Object serializable by msgpack
            Scheduler compatible state of the layer
        """
        return None

    @classmethod
    def distributed_unpack(
        cls, state: Any, dsk: Dict[str, Any], dependencies: Mapping[Hashable, Set]
    ) -> None:
        """Unpack the state of a layer previously packed by .distributed_pack()

        This method is called by the scheduler in Distributed in order to unpack
        the state of a layer and merge it into its global task graph. The method
        should update `dsk` and `dependencies`, which are the already materialized
        state of the preceding layers in the high level graph. The layers of the
        high level graph are unpacked in topological order.

        See Layer.distributed_pack() for packing detail.

        Parameters
        ----------
        state: Any
            The state returned by Layer.distributed_pack()
        dsk: dict
            The materialized low level graph of the already unpacked layers
        dependencies: Mapping
            The dependencies of each key in `dsk`
        """
        raise NotImplementedError(f"{type(cls)} doesn't implement distributed_unpack()")

Tests added / passed
Passes black dask / flake8 dask

madsbk · 2020-11-02T17:47:06Z

@jrbourbeau @mrocklin, this is ready for review (and merge). We might have to add some extra arguments to distributed_pack() in order to implement blockwise but this should be sufficient for shuffle.

mrocklin · 2020-11-03T01:50:35Z

dask/blockwise.py

        return out_d

+    def is_materialized(self):
+        return hasattr(self, "_cached_dict")


What are these for? Long-term, I would hope that we would not materialize the graphs, or even if we did materialize them we might not want to ship the materialized forms. Am I right to hope that these become unnecessary?

Yes, long-term this shouldn't be necessary. But I suspect that it will take some time before we get to that point :)

mrocklin · 2020-11-03T01:52:59Z

dask/dataframe/shuffle.py

+            input_keys.update(v)
+
+        raw = dict(obj)
+        raw = str_graph(raw, extra_values=input_keys)


I was a bit curious about the call to str_graph here. I understand that we need to do this at some point but I was hoping that it could be general. Thinking about this again though, this seems like the kind of thing that we do want to special-case in some circumstances (maybe we never make the tuples) so I guess it's maybe a good idea regardless.

dask/highlevelgraph.py

jrbourbeau · 2020-11-02T23:47:08Z

dask/dataframe/shuffle.py

+            "npartitions_input": self.npartitions_input,
+            "ignore_index": self.ignore_index,
+            "name_input": self.name_input,
+            "meta_input": self.meta_input.to_json(),


With to_json I think we can loose data type information. For example:

In [22]: df Out[22]: a b 0 1.0 4 1 2.0 5 2 3.0 6 In [23]: df.dtypes Out[23]: a float32 b UInt8 dtype: object In [24]: pd.read_json(df.to_json()) Out[24]: a b 0 1 4 1 2 5 2 3 6 In [25]: pd.read_json(df.to_json()).dtypes Out[25]: a int64 b int64 dtype: object

Good catch. I have fixed it by using to_serialize(), which also works with other types like cudf.DataFrame.

jrbourbeau · 2020-11-02T23:49:54Z

dask/highlevelgraph.py

+    The main motivation of a layer is to represent a collection of tasks
+    symbolically in order to speedup a series of operations significantly.
+    Ideally, a layer should stay in this symbolic state until execution
+    but in practice some operations will force the layer to generate all
+    its internal tasks. We say that the layer has been materialized.
+
+    Most of the default implementations in this class will materialize the
+    layer. It is up to derived classes to implement non-materializing
+    implementations.


Thank you for taking the time to add this note

madsbk · 2020-11-03T14:43:58Z

Everything passes when running the new Distributed test: https://github.com/dask/distributed/blob/9a0504981ef14cf5cd2b804497ec4d4c301359ec/distributed/protocol/tests/test_highlevelgraph.py

mrocklin · 2020-11-03T15:12:28Z

In [1]: import dask.array as da
In [2]: x = da.ones(10)
In [3]: y = x + 1
In [4]: y.__dask_graph__().layers[y.name]
Out[4]: Blockwise<(('ones-a7883aa065f53e381957f6940baaf48b', ('.0',)), (1, None)) -> add-a701e8a9322af8d1565010d507adf66a>

In [5]: y.__dask_graph__().layers[y.name].__dask_distributed_pack__()

mrocklin · 2020-11-03T15:13:01Z

Whoops, misfire on the comment. Should we be concerned that for some layers this returns None? Or is the plan that if this returns None we materialize?

mrocklin · 2020-11-03T15:21:08Z

Ah, I see that that is indeed the case:

https://github.com/dask/distributed/pull/4140/files#diff-c71878c6c1a8b7d02cc8543c78209e35c0a4b14d051eb9bcdeaaec233512dab2R66-R67

mrocklin · 2020-11-03T16:41:25Z

dask/blockwise.py


+    def is_materialized(self):
+        return hasattr(self, "_cached_dict")
+


Would it make sense to impement Blockwise.__dask_distributed_pack__? Maybe this would be a good task for @rjzamora given his current work on Blockwise

Yes, Blockwise.__dask_distributed_pack__() is definitely next!

But we might need some extra API to handle unpacking of remote data.

mrocklin · 2020-11-03T19:13:55Z

This is in. Thanks @madsbk !

madsbk added 3 commits November 2, 2020 08:44

clean up

9605229

hlg: added is_materialized()

08bf3a7

distributed_pack() and distributed_unpack()

bf2ce92

madsbk marked this pull request as ready for review November 2, 2020 16:24

madsbk mentioned this pull request Nov 2, 2020

HighLevelGraphs to the Scheduler dask/distributed#4140

Merged

mrocklin reviewed Nov 3, 2020

View reviewed changes

dask/highlevelgraph.py Outdated Show resolved Hide resolved

jrbourbeau reviewed Nov 3, 2020

View reviewed changes

madsbk added 5 commits November 3, 2020 10:14

Renamed __dask_distributed_pack__()

6c5ee2a

Merge branch 'master' of github.com:dask/dask into hlg_serialization

dbea275

fast stringify and serialize meta input

74c9a2c

Added notice: msgpack converts lists to tuples

38e082c

minor clean up

080d692

mrocklin reviewed Nov 3, 2020

View reviewed changes

mrocklin merged commit b439233 into dask:master Nov 3, 2020

madsbk deleted the hlg_serialization branch November 4, 2020 18:48


		def is_materialized(self):
		return hasattr(self, "_cached_dict")

Uh oh!

Conversation

madsbk commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk commented Nov 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madsbk Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madsbk commented Nov 3, 2020

Uh oh!

mrocklin commented Nov 3, 2020

Uh oh!

mrocklin commented Nov 3, 2020

Uh oh!

mrocklin commented Nov 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

madsbk commented Nov 2, 2020 •

edited

Loading

madsbk Nov 3, 2020 •

edited

Loading