[New feature] full code pickle for arbitrary modules

I have two problems with cloudpickle today:

# Python modules outside of PYTHONPATH
I develop a high-level framework and allow the end user to inject plugins (arbitrary .py files containing subclasses of classes defined by my framework). These python files are shipped by the user along with the data, meaning outside of the PYTHONPATH, and I load them with importlib. After all objects are loaded, I pickle the state of the whole framework for quick recovery and forensics. The problem is that, when loading such pickles, I have no idea where to find the user-provided python files - hence the unpickling fails. The workaround, to implement a sidecar config file that lists all paths that need to be added to the PYTHONPATH before unpickling, is ugly at best.

# dask distributed on docker
A typical devops design for dask distributed is to have:
- A generic docker image for the distributed scheduler (``conda install distributed nomkl bokeh``)
- An application-specific docker image for the client
- An application-specific docker image for the workers

The last point is a pain, particularly in the development phase, as it needs to contain all user-specific modules.
I can see how the distributed developers already noticed this problem and this is why (I guess) cloudpickle is rigged to fully serialise the code of any function or class that has been defined on the fly in a Jupyter Notebook. This is great for the very first stage of prototyping, but doesn't help when the user has part of his development-phase code written down in custom python files - which is the norm on any project that is not completely brand new.

It would be awesome if one could develop his whole application with a barebones, general purpose docker worker (e.g. ``conda install dask distributed numpy scipy``), which will just work(tm) as long as you don't introduce custom C extensions.

# Proposed design
I'm proposing two changes to address these problems:

1. in the cloudpickle package, expose a set of packages whose contents must be pickled completely. This goes to replace the hardcoded ``__main__`` references. This set would default to ``__main__`` only and the user would be able to add his own packages.
It should not be needed to specify individual subpackages - e.g. to enable full serialisation for ``foo.bar``, I just need to add ``foo`` to the set.
2. in the distributed package, add a config switch that specifies the order of _pickle-like_ modules. This solves performance issues too (see https://github.com/dask/distributed/issues/1371). This would default to the current pickle -> cloudpickle, but a user could want to disable pickle and go for cloudpickle directly. This would also allow using custom modules.

So for example, in the first lines of my own code, I would have:
```
import cloudpickle
cloudpickle.full_pickle_packages.add('mypackage')
```
And in ``.dask/config.yaml``:
```
# Order of pickle-like modules to be tried in sequence to serialize objects
pickle:
    # - pickle  # need to disable this to allow for full serialisation of mypackage
    - cloudpickle
```

With these changes, ``mypackage`` would need to exist exclusively on the client side and not on the worker.

Opinions / suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New feature] full code pickle for arbitrary modules #206

Python modules outside of PYTHONPATH

dask distributed on docker

Proposed design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New feature] full code pickle for arbitrary modules #206

Description

Python modules outside of PYTHONPATH

dask distributed on docker

Proposed design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions