Skip to content

Serialization of layers#6693

Merged
jrbourbeau merged 5 commits intodask:masterfrom
madsbk:hlg_serialization
Oct 19, 2020
Merged

Serialization of layers#6693
jrbourbeau merged 5 commits intodask:masterfrom
madsbk:hlg_serialization

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Oct 1, 2020

This PR implements a default serialization of Layer. In order to be compatible with pickle, we use the __reduce__ protocol but for security reasons Distributed uses msgpack for serialization when communicating with the scheduler thus this must also be compatible with msgpack (see dask/distributed#4140).

    def __reduce__(self):
        """Default serialization implementation, which materialize the Layer

        This should follow the standard pickle protocol[1] but must always return
        a tuple and the arguments for the callable object must be compatibly with
        msgpack. This is because Distributed uses msgpack to send Layers to the
        scheduler.

        [1] <https://docs.python.org/3/library/pickle.html#object.__reduce__>
        """
  • Tests added / passed
  • Passes black dask / flake8 dask

@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2020

Does msgpack respect these methods?

I'm curious to see how this would operate with Blockwise. We can do this as future work of course. I'm exctied :)

@madsbk madsbk changed the title Serializaion of layers Serialization of layers Oct 8, 2020
@madsbk
Copy link
Contributor Author

madsbk commented Oct 12, 2020

Does msgpack respect these methods?

Not natively but the changes in dask/distributed#4140 makes msgpack call __reduce__.
See https://github.com/dask/distributed/pull/4140/files#diff-7a335df475cb573124e2aedec25f055aR118

@jrbourbeau
Copy link
Member

Layer.__reduce__ appears to be interacting poorly with dask/tests/test_highgraph.py::test_map_tasks. @madsbk if you're able to diagnose that'd be great, otherwise I can look into it tomorrow

@madsbk
Copy link
Contributor Author

madsbk commented Oct 14, 2020

Layer.__reduce__ appears to be interacting poorly with dask/tests/test_highgraph.py::test_map_tasks.

The issue is that Blockwise.map_task() calls copy.copy(), which in turn calls __reduce__() and since the default implementation of __reduce__() materializes the Layer, we get the error.
I have fixed this by implementing a default __copy__(). Alternatively, we can avoid using __reduce__() all together and define our own API.

@madsbk
Copy link
Contributor Author

madsbk commented Oct 19, 2020

@jrbourbeau if CI passes, I think this is ready to be merged.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @madsbk!

There was John's comment (dask/distributed#4140 (comment)) about not using __reduce__ and instead defining separate methods. I'm happy to revisit that later and go with this for now to not block progress on dask/distributed#4140

@jrbourbeau jrbourbeau merged commit 2581be6 into dask:master Oct 19, 2020
@madsbk madsbk deleted the hlg_serialization branch October 21, 2020 07:54
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants