Skip to content

Allow pickling by value #6464

@ian-r-rose

Description

@ian-r-rose

Should we allow objects sent to schedulers or workers to be pickled by value more often? A small user story (where the user was me):

I was recently working with a small scheduler plugin to instrument some things on a cluster. It was small (a few tens of lines of code, mostly boilerplate), and I was developing it in a REPL and small scripts. Things were going great, until I wanted to start sharing that small plugin among a couple of different modules. In my case it was a test fixture, but it could just as easily have been a module alongside a set of related scripts or notebooks. Once the plugin was in a separate module (or conftest), things went very poorly. This is because once the plugin was no longer in __main__, it started to be pickled by reference instead of by value (as is the case when in __main__).

Now, if I were working on a worker plugin, I would have been able to reach for Client.upload_file for my plugin. This might not work well for things defined in a conftest.py, but would probably be fine if it was in a separate module. But the scheduler doesn't have any file uploading functionality, so I think I would be forced into packaging my ~50 LoC module into it's own package and rebuilding the scheduler software environment to include it. This is a huge pain! And a steep cliff for a small refactor of some reused code out of the __main__ context.

One solution could be to implement an equivalent scheduler file-uploading functionality. But I don't find that particularly ergonomic.
What if instead we allowed for pickling-by-value of more things than just those in __main__? CloudPickle 2.0 introduced a new "pickle-by-value" registry that allows the user to flag certain modules as ones that should not be pickled by reference.

An initial implementation could just allow that registry to take effect in distributed. This would put some onus on the user to actually register things, but I think it could prove useful. If so, we could add some more logic around when to add or remove things from that registry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureSomething is missing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions