Improved performance of call_and_shelve by MaximeWeyl · Pull Request #757 · joblib/joblib

MaximeWeyl · 2018-08-24T13:31:57Z

Here is a pull request that improves the performance of MemorizedFunc.call_and_shelve when a result has already been cached.

Basically, before the code request, joblib loads the entire cached file into memory, returns a reference to the file and forgets about what is in memory.
With this pull request, joblib creates the reference without loading the file into memory.

This PR passed the Travis tests on my fork.

Simple Benchmark

Definitions

import joblib
import numpy as np
memory = joblib.Memory("C:/TEMP/features/memory/")

@memory.cache
def large_array():
    return np.random.normal(size=(100000,10000))

Before the pull request

First call (no cache exists)

It takes 1m34s to create and return the cached array.

%%time
a = large_array.call_and_shelve()

________________________________________________________________________________
[Memory] Calling __main__-E%3A-Projets-Maxime-DSSP9-dssp9_maxime_project-02-model-__ipython-input__.large_array...
large_array()
_____________________________________________________large_array - 93.3s, 1.6min
Wall time: 1min 34s

Second call (array is already cached)

It takes 19.7s to get a reference on the array.

%%time
a = large_array.call_and_shelve()

Wall time: 19.7 s

Loading into memory

Then it takes 11.9 seconds to load the reference into memory.

%%time
a.get()

Wall time: 11.9 s





array([[-0.22347288,  0.1004392 ,  0.73821429, ..., -0.17401156,
         0.45169724,  0.47828354],
       [-1.17715096, -0.8288893 , -0.39735764, ...,  1.60723914,
        -0.43427087,  2.91798611],
       [-0.42882539,  0.58131643,  0.7192454 , ..., -0.59368638,
        -0.18502537, -0.67700329],
       ...,
       [-1.06988518, -0.40570113,  0.83195304, ...,  1.16679448,
         2.07565885,  0.19337173],
       [-0.99542231, -0.33650311, -1.01535946, ...,  0.36570109,
        -0.42031285, -0.80028137],
       [-0.17242059,  0.36116285, -0.76737515, ...,  0.12322468,
        -1.85473507, -0.6005992 ]])

After the pull request

First call (no cache exists)

It still takes time to create the cached object.

Second call (array is already cached)

It takes 120ms to get a reference on the array.
It took near 20s before.

%%time
a = large_array.call_and_shelve()

Wall time: 120 ms

Loading into memory

Then it takes 21.3 seconds to load the reference into memory.

%%time
a.get()

Wall time: 21.3 s

array([[-1.98366351, -1.48535076,  0.76613301, ..., -0.22057336,
         1.64428191,  1.25740524],
       [-1.49231368,  0.66844315, -1.72310301, ..., -0.68349175,
        -0.85695767,  1.1905448 ],
       [-0.72434763,  0.10530391,  0.28199925, ..., -0.22877864,
         0.61428269, -0.54311867],
       ...,
       [ 1.32977779, -0.95798603, -1.18254297, ..., -2.57229086,
         0.13107809, -1.08506098],
       [ 1.38374842, -0.37575841,  0.84639939, ...,  0.98320816,
        -0.47348908,  1.14810158],
       [-2.18351856, -0.39519966, -2.88522986, ..., -1.30701304,
         0.43026396,  0.37374147]])

codecov · 2018-08-24T13:44:21Z

Codecov Report

Merging #757 into master will decrease coverage by 0.8%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #757      +/-   ##
==========================================
- Coverage   95.42%   94.62%   -0.81%     
==========================================
  Files          43       43              
  Lines        6208     6210       +2     
==========================================
- Hits         5924     5876      -48     
- Misses        284      334      +50

Impacted Files	Coverage Δ
joblib/memory.py	`95.53% <100%> (+0.02%)`	⬆️
joblib/backports.py	`39.58% <0%> (-54.17%)`	⬇️
joblib/disk.py	`80% <0%> (-8.34%)`	⬇️
joblib/pool.py	`87.93% <0%> (-3.45%)`	⬇️
joblib/_parallel_backends.py	`93.6% <0%> (-2.41%)`	⬇️
joblib/test/common.py	`86.44% <0%> (-1.7%)`	⬇️
joblib/logger.py	`85.52% <0%> (-1.32%)`	⬇️
joblib/func_inspect.py	`94.31% <0%> (-1.14%)`	⬇️
joblib/test/test_parallel.py	`96.47% <0%> (-0.59%)`	⬇️
joblib/test/test_memmapping.py	`99.7% <0%> (-0.3%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 139e13e...1d67fcb. Read the comment docs.

aabadie · 2018-08-24T14:26:12Z

Nice one! It would be great to also add a test.

MaximeWeyl · 2018-08-24T16:11:11Z

Sure I could write one. I have been thinking about what should be tested, but did not find any good idea.

Indeed, this PR does not change, add, nor removes any functionality. Just a tiny bit of the implementation of the current functionalities.

I could test that the performance remains below a given threshold, but I'm not so sure it is a good idea (mainly because it depends on the worker of the test).

I could also mock the store_backend and assert that it has not been called, but : it depends on the implementation, and I have not been able to find any use of mocking in the entire repository. It is the best idea I was able to come up with nonetheless.

What do you think about it ?

aabadie · 2018-08-24T17:31:07Z

I could also mock the store_backend and assert that it has not been called, but : it depends on the implementation, and I have not been able to find any use of mocking in the entire repository. It is the best idea I was able to come up with nonetheless.

I had the same idea and that's what I would also do. But maybe other maintainers have better ideas. @lesteve @ogrisel @tomMoral, what do you think ?

ogrisel · 2018-08-27T07:21:44Z

I think it's fine not to add any new test in this case. We could have an automated benchmark suite for performance test (for instance using airspeed velocity) but this would require some significant infrastructure investment.

ogrisel · 2018-08-27T07:30:21Z

LGTM, merging. Thank you very much for this contribution @MaximeWeyl !

MaximeWeyl · 2018-08-27T11:56:54Z

You're welcome !

* tag '0.12.3': (23 commits) Release 0.12.3 Loky 2.2.1 (joblib#760) FIX: FileSystemStoreBackend string representation only returning location (path) (joblib#765) Add optional dependency on psutil MAINT remove brittle time based assertion in test (joblib#761) Fix a bug in nesting level computation with FallbackToBackend(SequentialBackend()) (joblib#759) Make docstring more consistent with project style Improved performance of call_and_shelve (joblib#757) Better test name Fix default context handling (joblib#754) cloudpickle 0.5.5 (joblib#756) Fixed error where filter args was consuming kwargs (joblib#750) FIX pickle roundtrip for Memory and related classes. (joblib#746) test that passes but timeout appveyor because of unclosed semaphore tracker (joblib#676) DOC: fine tune compressor example dataset size for ReadTheDocs (joblib#753) FIX: MemorizedResult not picklable (joblib#752) [MRG] Better message with py27 when lz4 is not installed (joblib#740) MNT remove mutable default value for backend_options parameter (joblib#748) MNT create test file for _store_backends module (joblib#749) DOC consistently use memory rather than mem in memory.rst (joblib#744) ...

* releases: (23 commits) Release 0.12.3 Loky 2.2.1 (joblib#760) FIX: FileSystemStoreBackend string representation only returning location (path) (joblib#765) Add optional dependency on psutil MAINT remove brittle time based assertion in test (joblib#761) Fix a bug in nesting level computation with FallbackToBackend(SequentialBackend()) (joblib#759) Make docstring more consistent with project style Improved performance of call_and_shelve (joblib#757) Better test name Fix default context handling (joblib#754) cloudpickle 0.5.5 (joblib#756) Fixed error where filter args was consuming kwargs (joblib#750) FIX pickle roundtrip for Memory and related classes. (joblib#746) test that passes but timeout appveyor because of unclosed semaphore tracker (joblib#676) DOC: fine tune compressor example dataset size for ReadTheDocs (joblib#753) FIX: MemorizedResult not picklable (joblib#752) [MRG] Better message with py27 when lz4 is not installed (joblib#740) MNT remove mutable default value for backend_options parameter (joblib#748) MNT create test file for _store_backends module (joblib#749) DOC consistently use memory rather than mem in memory.rst (joblib#744) ...

Improved performance of call_and_shelve

1d67fcb

ogrisel merged commit 79b7ed8 into joblib:master Aug 27, 2018

aabadie mentioned this pull request Oct 15, 2018

memory: add test for call_and_shelve performance #791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved performance of call_and_shelve#757

Improved performance of call_and_shelve#757
ogrisel merged 1 commit intojoblib:masterfrom
MaximeWeyl:master

MaximeWeyl commented Aug 24, 2018 •

edited

Loading

Uh oh!

codecov bot commented Aug 24, 2018 •

edited

Loading

Uh oh!

aabadie commented Aug 24, 2018

Uh oh!

MaximeWeyl commented Aug 24, 2018

Uh oh!

aabadie commented Aug 24, 2018

Uh oh!

ogrisel commented Aug 27, 2018

Uh oh!

ogrisel commented Aug 27, 2018

Uh oh!

MaximeWeyl commented Aug 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MaximeWeyl commented Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Simple Benchmark

Definitions

Before the pull request

First call (no cache exists)

Second call (array is already cached)

Loading into memory

After the pull request

First call (no cache exists)

Second call (array is already cached)

Loading into memory

Uh oh!

codecov bot commented Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aabadie commented Aug 24, 2018

Uh oh!

MaximeWeyl commented Aug 24, 2018

Uh oh!

aabadie commented Aug 24, 2018

Uh oh!

ogrisel commented Aug 27, 2018

Uh oh!

ogrisel commented Aug 27, 2018

Uh oh!

MaximeWeyl commented Aug 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MaximeWeyl commented Aug 24, 2018 •

edited

Loading

codecov bot commented Aug 24, 2018 •

edited

Loading