Skip to content

[WIP] introduce object shelving#619

Open
aabadie wants to merge 9 commits intojoblib:mainfrom
aabadie:object_store
Open

[WIP] introduce object shelving#619
aabadie wants to merge 9 commits intojoblib:mainfrom
aabadie:object_store

Conversation

@aabadie
Copy link
Copy Markdown
Contributor

@aabadie aabadie commented Jan 26, 2018

This PR fixes #593 but is still WIP.

Note that this basic shelving can only be used with a script using a single python process because the futures returned by the shelf are only referenced in this python process. This means that it may not work as expected if using Parallel with loky or multiprocessing backend. But it should work with threading.

There are also some tests for the basic functionalities and top-level functions exposed to users:

  • shelving object from the standard library: dict, list, string, numbers
  • shelving numpy arrays
  • verifying that folders containing the shelved data are deleted when expected and not deleted when not expected (most important!)

Data are deleted in the following cases:

  • a shelved data is deleted from disk only when its last future reference is deleted
  • if the shelf object is deleted, the global shelf directory is fully deleted only when all futures are derefenced. This ensure that a future doesn't point to an already removed data.

Here are examples showing how to use this new feature:

>>> from joblib import shelve
>>> data = "A very big data"
>>> future = shelve(data)
>>> # Now the data can be removed
>>> del data
>>> future
<joblib.shelf.JoblibShelfFuture at 0x7fc15fad7a90>
>>> # The data can be retrieved from the future
>>> print(future.result())
A very big data

Here is the version with memmap:

>>> import numpy as np
>>> from joblib import shelve_mmap
>>> data = np.ones((10, 10))
>>> mmap = shelve_mmap(data)
>>> # the initial data can be removed
>>> del data
>>> # but can still be retrieved from the mmap
>>> mmap
memmap([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Jan 26, 2018

I updated the initial comment with a bit more information and a better wording

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 26, 2018

Codecov Report

Merging #619 into master will increase coverage by 0.26%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #619      +/-   ##
==========================================
+ Coverage   95.02%   95.29%   +0.26%     
==========================================
  Files          39       41       +2     
  Lines        5427     5586     +159     
==========================================
+ Hits         5157     5323     +166     
+ Misses        270      263       -7
Impacted Files Coverage Δ
joblib/__init__.py 100% <100%> (ø) ⬆️
joblib/test/test_shelf.py 100% <100%> (ø)
joblib/memory.py 95.38% <100%> (+0.05%) ⬆️
joblib/shelf.py 100% <100%> (ø)
joblib/test/test_parallel.py 96.31% <0%> (+0.52%) ⬆️
joblib/_parallel_backends.py 95.68% <0%> (+1.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1507e2...7338508. Read the comment docs.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Can you add an example while you work on this PR. I find this very useful, because it helps thinking of the use case.

from .parallel import register_parallel_backend
from .parallel import parallel_backend
from .parallel import effective_n_jobs
from .shelf import JoblibShelf, shelve, shelve_mmap
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose to the user JoblibShelf? It seems to me that it should be internal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, will change that

All values are cached on the filesystem, in a deep directory
structure.

see :ref:`memory_reference`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, did we loose the docstring?

Copy link
Copy Markdown
Contributor Author

@aabadie aabadie Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already like this IIRC. The __init__ docstring has moved to StoreBase

@aabadie aabadie changed the title WIP: introduce object shelving [WIP] introduce object shelving Jan 27, 2018
@aabadie aabadie force-pushed the object_store branch 8 times, most recently from cbb7361 to 999feab Compare January 27, 2018 21:19
@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Jan 27, 2018

Can you add an example while you work on this PR. I find this very useful, because it helps thinking of the use case.

I added a couple of examples in each function docstring. But maybe you are talking of a sphinx-gallery example ? If you have good ideas of examples, I buy them

@GaelVaroquaux
Copy link
Copy Markdown
Member

Aside from the example, what remains to be done here?

joblib/shelf.py Outdated
memory. The future, a light-weight object, can be used later to reload the
initial object.

During the life of the future, the input object is kept written on a store
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather say "The input object is kept in a store (by default a file on a disk) as long as the future object exists (technically: as long as there is a reference on the future)".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

joblib/shelf.py Outdated
return _active_shelf.put(input_object)


def shelve_mmap(input_array):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the variable should rather be called "input_object" (here and in the docstring below, the word "array" should often be replaced by "object").

Copy link
Copy Markdown
Contributor Author

@aabadie aabadie Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only meant to be used with numpy arrays, since it returns a future on a mmap. That's why I think it's important to keep the 'array' in the variable name

@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Apr 19, 2018

Aside from the example, what remains to be done here?

@ogrisel suggested that the shelve_mmap could directly return the np.memmap object instead of the future. The future will be used internally to flush the array on disk when no more reference exists on the memmap.

This will allow a transparent use of this function with Parallel calls: no need to use the result() method of the future to retrieve the memmap.

@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Apr 19, 2018

@ogrisel, added the mmap change in e35aa4a
As expected, it works with the threading backend but not with loky (multiprocessing)

@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Apr 19, 2018 via email

@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Apr 19, 2018

It's not limited to arrays. It should be able to take any object as an input.

I updated the input parameter and the shelve_mmap function documentation a bit. Now it directly returns a memory mmap.

@aabadie
Copy link
Copy Markdown
Contributor Author

aabadie commented Nov 1, 2022

Is this still of interest to joblib ? I'd like to close it if possible :) And I'm not sure if it's in a rebasable state.

@Nanored4498 Nanored4498 mentioned this pull request Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose storing dumping of objects to disk

2 participants