Skip to content

[New feature] full code pickle for arbitrary modules #206

@crusaderky

Description

@crusaderky

I have two problems with cloudpickle today:

Python modules outside of PYTHONPATH

I develop a high-level framework and allow the end user to inject plugins (arbitrary .py files containing subclasses of classes defined by my framework). These python files are shipped by the user along with the data, meaning outside of the PYTHONPATH, and I load them with importlib. After all objects are loaded, I pickle the state of the whole framework for quick recovery and forensics. The problem is that, when loading such pickles, I have no idea where to find the user-provided python files - hence the unpickling fails. The workaround, to implement a sidecar config file that lists all paths that need to be added to the PYTHONPATH before unpickling, is ugly at best.

dask distributed on docker

A typical devops design for dask distributed is to have:

  • A generic docker image for the distributed scheduler (conda install distributed nomkl bokeh)
  • An application-specific docker image for the client
  • An application-specific docker image for the workers

The last point is a pain, particularly in the development phase, as it needs to contain all user-specific modules.
I can see how the distributed developers already noticed this problem and this is why (I guess) cloudpickle is rigged to fully serialise the code of any function or class that has been defined on the fly in a Jupyter Notebook. This is great for the very first stage of prototyping, but doesn't help when the user has part of his development-phase code written down in custom python files - which is the norm on any project that is not completely brand new.

It would be awesome if one could develop his whole application with a barebones, general purpose docker worker (e.g. conda install dask distributed numpy scipy), which will just work(tm) as long as you don't introduce custom C extensions.

Proposed design

I'm proposing two changes to address these problems:

  1. in the cloudpickle package, expose a set of packages whose contents must be pickled completely. This goes to replace the hardcoded __main__ references. This set would default to __main__ only and the user would be able to add his own packages.
    It should not be needed to specify individual subpackages - e.g. to enable full serialisation for foo.bar, I just need to add foo to the set.
  2. in the distributed package, add a config switch that specifies the order of pickle-like modules. This solves performance issues too (see Everything is pickled twice dask/distributed#1371). This would default to the current pickle -> cloudpickle, but a user could want to disable pickle and go for cloudpickle directly. This would also allow using custom modules.

So for example, in the first lines of my own code, I would have:

import cloudpickle
cloudpickle.full_pickle_packages.add('mypackage')

And in .dask/config.yaml:

# Order of pickle-like modules to be tried in sequence to serialize objects
pickle:
    # - pickle  # need to disable this to allow for full serialisation of mypackage
    - cloudpickle

With these changes, mypackage would need to exist exclusively on the client side and not on the worker.

Opinions / suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions