Skip to content

FIX pickle roundtrip for Memory and related classes.#746

Merged
ogrisel merged 2 commits intojoblib:masterfrom
lesteve:fix-memory-pickle
Aug 23, 2018
Merged

FIX pickle roundtrip for Memory and related classes.#746
ogrisel merged 2 commits intojoblib:masterfrom
lesteve:fix-memory-pickle

Conversation

@lesteve
Copy link
Copy Markdown
Member

@lesteve lesteve commented Aug 14, 2018

Fix #741. Close #743.

A few things:

  • this does not address the Memory pickle being broken between joblib 0.11 and 0.12 mentioned in Fix #741: Memory pickling was broken with mmap_mode != None #743 (comment). This potentially affects also scikit-learn 0.19 to 0.20 if you pickled estimators that use Memory (e.g. Pipeline, AgglomerativeClustering). This could be argued that we don't guarantee pickles compatibility between scikit-learn versions anyway. The error is quite misleading but at least this is an error and not a silent change ...
  • it feels like there is some tension between hashing and cloning in the implementation of __reduce__. This PR is leaning towards cloning quite heavily but maybe a bit too much ...

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Aug 14, 2018

@aabadie if you are around and have some spare bandwidth, I would definitely be interested to have your opinion on this!

I may have overlooked some important reasons about why __reduce__ functions were implemented this way.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 14, 2018

Codecov Report

Merging #746 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #746      +/-   ##
==========================================
- Coverage    95.4%   95.36%   -0.05%     
==========================================
  Files          43       43              
  Lines        6160     6191      +31     
==========================================
+ Hits         5877     5904      +27     
- Misses        283      287       +4
Impacted Files Coverage Δ
joblib/test/test_memory.py 98.05% <100%> (+0.1%) ⬆️
joblib/memory.py 95.5% <100%> (-0.25%) ⬇️
joblib/disk.py 81.66% <0%> (-6.67%) ⬇️
joblib/test/test_store_backends.py 91.17% <0%> (-5.89%) ⬇️
joblib/backports.py 93.75% <0%> (-2.09%) ⬇️
joblib/_store_backends.py 89.94% <0%> (-0.53%) ⬇️
joblib/test/test_parallel.py 96.88% <0%> (-0.15%) ⬇️
joblib/_parallel_backends.py 97.2% <0%> (+2.4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ce40a7...c24775e. Read the comment docs.

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Aug 14, 2018

it feels like there is some tension between hashing and cloning in the implementation of __reduce__.

Just to elaborate on this, let's take the example of verbose in Memory:

  • argument for hashing. If you have a snippet like this:
memory = Memory(location='/my/location')
pipe = Pipeline(memory=memory, ...)

func_cached = memory.cache(func)

func_cached(pipe)

you don't want the hash of pipe to change if you change memory.verbose because that could lead to time-consuming recomputations. In this case, you don't want verbose in Memory.__reduce__.

  • argument for cloning. You have a snippet like this:
memory = Memory(location='/my/location', verbose=10)
memory_copy = copy.deepcopy(memory)
assert memory_copy.verbose == 10  # this seems like a very reasonable expectation

You need verbose in Memory.__reduce__ if you want the assert to pass: if verbose is not in Memory.__reduce__, memory_copy.verbose will be the verbose default value.

Copy link
Copy Markdown
Contributor

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a commit added an attribute, would your tests fail?

Copy link
Copy Markdown
Contributor

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood. Yes, they would.

Copy link
Copy Markdown
Contributor

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your tests do cover the addition of attributes.

I think this is more complex than it should be. I don't think we need to introduce _reconstruction_from_parameters_dict. I don't think we should change the stored attributes on the objects just to make pickling easier to manage. And I wonder if the tests are too complicated to read: what are the practical differences between this and my tests?

This also appears harder to maintain than my PR, in that any change to Memory or MemorizedFunc's __init__ args will result in a test failure. Yes, this is more explicit, but to what end?

joblib/memory.py Outdated
self.timestamp = time.time()
self.bytes_limit = bytes_limit
self.backend = backend
self.location = location
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change the attributes?

joblib/memory.py Outdated
self.mmap_mode = mmap_mode
self.compress = compress
self.func = func
self.location = location
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change the attributes?

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Aug 16, 2018

I think this is more complex than it should be. I don't think we need to introduce _reconstruction_from_parameters_dict.

I tried using __getstate__ to make it simpler. Looking at the code in 0.11, it feels like all the __reduce__ was trying to do is ignore timestamp (for hashing purposes)

I don't think we should change the stored attributes on the objects just to make pickling easier to manage.

I have reduced attribute changes to a minimum. The only thing I have kept is compress because it is needed to be passed properly to MemorizedFunc

And I wonder if the tests are too complicated to read: what are the practical differences between this and my tests?

IIRC you were only testing Memory, whereas because I found problems with MemorizedFunc and MemorizedResult so I decided to test them too. More than happy to make the test easier to read if you have suggestions!

@lesteve lesteve force-pushed the fix-memory-pickle branch 4 times, most recently from e38c589 to 6c9e06f Compare August 21, 2018 12:50
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 21, 2018

This does not solve the problem of piclkle in 0.11 and load in 0.12 though:

>>> import pickle
>>> pickle
pickle.loads(b'\x80\x03cjoblib.memory\nMemory\nq\x00(X\t\x00\x00\x00/tmp/testq\x01N\x89K\x01tq\x02Rq\x03.')
Traceback (most recent call last):
  File "<ipython-input-3-1cde88ec75ed>", line 2, in <module>
    pickle.loads(b'\x80\x03cjoblib.memory\nMemory\nq\x00(X\t\x00\x00\x00/tmp/testq\x01N\x89K\x01tq\x02Rq\x03.')
  File "/volatile/ogrisel/code/joblib/joblib/memory.py", line 837, in __init__
    location, cachedir))
ValueError: You set both "location='/tmp/test' and "cachedir=False". 'cachedir' has been deprecated in version 0.12 and will be removed in version 0.14.
Please only set "location='/tmp/test'"

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 21, 2018

@lesteve restoring the __init__ args as follows restores the compatibility with 0.11:

diff --git a/joblib/memory.py b/joblib/memory.py
index 016554a..eac15b7 100644
--- a/joblib/memory.py
+++ b/joblib/memory.py
@@ -809,9 +809,9 @@ class Memory(Logger):
     # Public interface
     # ------------------------------------------------------------------------
 
-    def __init__(self, location=None, backend='local', cachedir=None,
-                 mmap_mode=None, compress=False, verbose=1, bytes_limit=None,
-                 backend_options=None):
+    def __init__(self, location=None, mmap_mode=None, compress=False,
+                 verbose=1, bytes_limit=None, backend='local',
+                 backend_options=None, cachedir=None):
         # XXX: Bad explanation of the None value of cachedir
         Logger.__init__(self)
         self._verbose = verbose

and all the test still pass along with the scikit-learn example as explained in #743 (comment).

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me once the ordering of the init args has been restored to enable compat with 0.11 objects. Please also include the 0.11 loads case as a new test:

import pickle
memory_joblib011 = pickle.loads(b'\x80\x03cjoblib.memory\nMemory\nq\x00(X\t\x00\x00\x00/tmp/testq\x01N\x89K\x01tq\x02Rq\x03.')
compare(memory_joblib011, Memory(location='/dev/test'),
        ignored_attrs='timestamp')

while including the joblib 0.11 snippet that generated the pickle in a comment:

import pickle
from joblib import Memory

mem = Memory(cachedir='/tmp/test')
print(pickle.dumps(mem)) 

@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Aug 22, 2018

Thanks for providing a fix @lesteve. Using set_state looks more robust than reduce.

I may have overlooked some important reasons about why reduce functions were implemented this way.

The changes here may impact related projects that already provide custom store backends, like joblib-s3 or joblib-hadoop. They will probably need an update (not verified yet)

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 22, 2018

Thanks for the fix @lesteve!

@jnothman I think I prefer the __getstate__ solution to yours and plan to merge this PR so as to be able to release joblib 0.12.2 with this fix before scikit-learn 0.20 final. Any final review/comment prior to merging?

@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Aug 22, 2018

@lesteve, just in case #752 was merged and this one needs rebase now.

@lesteve lesteve force-pushed the fix-memory-pickle branch from d4242ae to 168fca0 Compare August 22, 2018 11:52
@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Aug 22, 2018

Rebased.

@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Aug 22, 2018

Adding an entry in the changelog would be great.

@lesteve
Copy link
Copy Markdown
Member Author

lesteve commented Aug 22, 2018

I added an entry in CHANGES.rst. Feel free to improve it!

@ogrisel ogrisel merged commit c492cf3 into joblib:master Aug 23, 2018
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 23, 2018

Ok merged. Thank you very much @lesteve!

@lesteve lesteve deleted the fix-memory-pickle branch August 24, 2018 09:04
yarikoptic added a commit to yarikoptic/joblib that referenced this pull request Sep 4, 2018
* tag '0.12.3': (23 commits)
  Release 0.12.3
  Loky 2.2.1 (joblib#760)
  FIX: FileSystemStoreBackend string representation only returning location (path) (joblib#765)
  Add optional dependency on psutil
  MAINT remove brittle time based assertion in test (joblib#761)
  Fix a bug in nesting level computation with FallbackToBackend(SequentialBackend()) (joblib#759)
  Make docstring more consistent with project style
  Improved performance of call_and_shelve (joblib#757)
  Better test name
  Fix default context handling (joblib#754)
  cloudpickle 0.5.5 (joblib#756)
  Fixed error where filter args was consuming kwargs (joblib#750)
  FIX pickle roundtrip for Memory and related classes. (joblib#746)
  test that passes but timeout appveyor because of unclosed semaphore tracker (joblib#676)
  DOC: fine tune compressor example dataset size for ReadTheDocs (joblib#753)
  FIX: MemorizedResult not picklable (joblib#752)
  [MRG] Better message with py27 when lz4 is not installed (joblib#740)
  MNT remove mutable default value for backend_options parameter (joblib#748)
  MNT create test file for _store_backends module (joblib#749)
  DOC consistently use memory rather than mem in memory.rst (joblib#744)
  ...
yarikoptic added a commit to yarikoptic/joblib that referenced this pull request Sep 4, 2018
* releases: (23 commits)
  Release 0.12.3
  Loky 2.2.1 (joblib#760)
  FIX: FileSystemStoreBackend string representation only returning location (path) (joblib#765)
  Add optional dependency on psutil
  MAINT remove brittle time based assertion in test (joblib#761)
  Fix a bug in nesting level computation with FallbackToBackend(SequentialBackend()) (joblib#759)
  Make docstring more consistent with project style
  Improved performance of call_and_shelve (joblib#757)
  Better test name
  Fix default context handling (joblib#754)
  cloudpickle 0.5.5 (joblib#756)
  Fixed error where filter args was consuming kwargs (joblib#750)
  FIX pickle roundtrip for Memory and related classes. (joblib#746)
  test that passes but timeout appveyor because of unclosed semaphore tracker (joblib#676)
  DOC: fine tune compressor example dataset size for ReadTheDocs (joblib#753)
  FIX: MemorizedResult not picklable (joblib#752)
  [MRG] Better message with py27 when lz4 is not installed (joblib#740)
  MNT remove mutable default value for backend_options parameter (joblib#748)
  MNT create test file for _store_backends module (joblib#749)
  DOC consistently use memory rather than mem in memory.rst (joblib#744)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants