ENH use cloudpickle to pickle functions and class from __main__#677
ENH use cloudpickle to pickle functions and class from __main__#677tomMoral merged 13 commits intojoblib:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #677 +/- ##
=========================================
+ Coverage 95.08% 95.2% +0.12%
=========================================
Files 39 39
Lines 5427 5462 +35
=========================================
+ Hits 5160 5200 +40
+ Misses 267 262 -5
Continue to review full report at Codecov.
|
|
The failure in CI should be fixed with #685 . I noticed today that the pickling of function in main might be very long compared to the pickling of a function in another module. It can drastically degrade the performance for very large number of small tasks. I think this should be benchmarked and documented in this PR to make sure we are not introducing bottleneck in the computations. |
joblib/parallel.py
Outdated
| return self._size | ||
|
|
||
| @staticmethod | ||
| def _wrap_non_picklable_objects(obj): |
There was a problem hiding this comment.
Please insert a comment explaining why we do this shallow introspection of objects instead of using the cloudpickle.Pickler class directly to pickle everything.
# The following does a shallow introspection of the Python objects passed as
# args to a parallel call. Interactively defined functions are pre-pickled manually
# using cloudpickle.
# We do not use cloudpickle for all arguments as it would incur a significant
# performance overhead when pickling large cardinality dicts or lists for
# instance.
joblib/parallel.py
Outdated
| def _wrap_non_picklable_objects(obj): | ||
| need_wrap = "__main__" in getattr(obj, "__module__", "") | ||
| if callable(obj): | ||
| # Need wrap if the object is a local function |
There was a problem hiding this comment.
Can you please explain what a "local function" is?
joblib/test/test_parallel.py
Outdated
| print(Parallel(n_jobs=2, backend=backend)( | ||
| delayed(run)(square, i) for i in range(5))) | ||
| print(Parallel(n_jobs=2, backend=backend)( | ||
| delayed(run)(f=square, x=i) for i in range(5))) |
There was a problem hiding this comment.
Please add a test with a lambda expression.
- This only works for shallow object, for instance this does not work on a list of functions defined in main: args=([func_main_1, func_main_2,..],)
- Test for lambda-expr, locally defined functions and function from __main__ - Remove duplicated test
joblib/parallel.py
Outdated
| self.n_dispatched_batches = 0 | ||
| self.n_dispatched_tasks = 0 | ||
| self.n_completed_tasks = 0 | ||
| # Use a caching dict for callable that are pickled with cloudpickle to |
joblib/parallel.py
Outdated
| # Use a caching dict for callable that are pickled with cloudpickle to | ||
| # improve performances. This cache is used only in the case of | ||
| # functions that are defined in the __main__ module, function that are | ||
| # defined locally (inside another function) and lambda expression. |
joblib/parallel.py
Outdated
| self.n_completed_tasks = 0 | ||
| # Use a caching dict for callable that are pickled with cloudpickle to | ||
| # improve performances. This cache is used only in the case of | ||
| # functions that are defined in the __main__ module, function that are |
* tag '0.12': (116 commits) Release 0.12 typo typo typo ENH add initializer limiting n_threads for C-libs (joblib#701) DOC better parallel docstring (joblib#704) [MRG] Nested parallel call thread bomb mitigation (joblib#700) MTN vendor loky2.1.3 (joblib#699) Make it possible to configure the reusable executor workers timeout (joblib#698) MAINT increase timeouts to make test more robust on travis DOC: use the .joblib extension instead of .pkl (joblib#697) [MRG] Fix exception handling in nested parallel calls (joblib#696) Fix skip test lz4 not installed (joblib#695) [MRG] numpy_pickle: several enhancements (joblib#626) Introduce Parallel.__call__ backend callbacks (joblib#689) Add distributed on readthedocs (joblib#686) Support registration of external backends (joblib#655) [MRG] Add a dask.distributed example (joblib#613) ENH use cloudpickle to pickle interactively defined callable (joblib#677) CI freeze the version of sklearn0.19.1 and scipy1.0.1 (joblib#685) ...
Fix for #674
This PR introspects the
BatchCallsand pickles functions and class methods from__main__usingcloudpickle. This is one solution that we discussed with @ogrisel . However, it does not handle deep object introspection. For instance, this would fail ifargscontains a list of functions that are defined in__main__.One solution would be to use a
try..exceptwhen launching the job but I am not sure it is possible as the pickling is handled in a separate thread at least for theProcessPoolExecutor.Another solution would be to implement a custom reducer for the function but it might not be possible to fall back on default function pickling in this case.