Skip to content

Implement in_store method to check if a result is already in store#730

Closed
nwilming wants to merge 4 commits intojoblib:masterfrom
nwilming:master
Closed

Implement in_store method to check if a result is already in store#730
nwilming wants to merge 4 commits intojoblib:masterfrom
nwilming:master

Conversation

@nwilming
Copy link
Copy Markdown

In some cases it is nice to be able to check if the result of a function call is already available in the backend store. For example, when dispatching long computations to a cluster this allows to check if and which computations have already run. Being able to check for this makes it particularly easy to collect all available results without triggering computation of those that are still running. At least two issues have mentioned this ( #407 - a pull request which now does not merge with master and #668).

This is an attemt to provide this functionality by introducing an 'in_store' method that can be called with the same signature as the original function. When a result is in store for some input arguments it returns True, else False.

Copy link
Copy Markdown
Contributor

@aabadie aabadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nwilming for working on this. I have a few minor comments, see below.

There's also this comment, raised in #407, that I'm not sure if it's still valid and if it should be addressed here.

joblib/memory.py Outdated
return (self.__class__, (self.func, self.store_backend, self.ignore,
self.mmap_mode, self.compress, self._verbose))

def in_store(self, *args, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a name that better reflects what Memory is about: caching results. Why not calling this function: cached ?
The concept of store is more related to the underlying way used to handle the cache.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I'll rename the function to cahced.

joblib/memory.py Outdated
load from backend, else False.
"""
func_id, args_id = self._get_output_identifiers(*args, **kwargs)
if (self._check_previous_func_code(stacklevel=4) and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done in a one-liner:

return (self._check_previous_func_code(stacklevel=4) and self.store_backend.contains_item([func_id, args_id]))

(You will just have to split the line to be pep8 compliant)

args = (1, 2, 3, 4)
kwargs = {'e': 5, 'f': 6}
g(*args, **kwargs)
assert(g.in_store(*args, **kwargs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for parenthesis with assert. Use the following instead (and be consistent with other uses of assert):

assert g.in_store(*args, **kwargs)

joblib/memory.py Outdated

Returns
-------
cached: bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this line

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 10, 2018

Codecov Report

Merging #730 into master will decrease coverage by 0.01%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #730      +/-   ##
==========================================
- Coverage   95.31%   95.29%   -0.02%     
==========================================
  Files          42       42              
  Lines        6128     6145      +17     
==========================================
+ Hits         5841     5856      +15     
- Misses        287      289       +2
Impacted Files Coverage Δ
joblib/test/test_memory.py 97.54% <100%> (+0.05%) ⬆️
joblib/memory.py 95.15% <80%> (-0.22%) ⬇️
joblib/backports.py 93.75% <0%> (-2.09%) ⬇️
joblib/_parallel_backends.py 96.8% <0%> (-1.21%) ⬇️
joblib/_store_backends.py 90.47% <0%> (-0.53%) ⬇️
joblib/disk.py 88.33% <0%> (+6.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35901b8...aaf5abf. Read the comment docs.

@nwilming
Copy link
Copy Markdown
Author

I've reabased my code on upstream/master and made the changes you requested. Let me know if something is missing or needs additional changes.

Regarding the comment in #407 that cached can return false positives: My use case is that I distribute function calls onto a Torque cluster (across args) and once everything is done I collect the results by calling the function with all relevant args. In this context being able to check for presence of results in the backend is useful on it's own and I've never had false positives yet.

But I guess eventually these will occur and then the current cached function would not suffice to allow for explicit rescheduling of function calls with false positives. This would require a complementary load function that loads results and raises an exception when the result can not be loaded. I've added this function as well (essentially just a refactoring of _cached_call) in commit aaf5abf.

Let me know what you think.

@AlJohri
Copy link
Copy Markdown

AlJohri commented Oct 24, 2018

I would love to see this integrated @aabadie. Thanks for doing this @nwilming!

Copy link
Copy Markdown
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Sorry for the rather long reviewing process. Thank you very much for your work, I think this is indeed a nice feature that has been missing for quite some time.

The PR is a bit outdated and it seems that introducing load might drop a bit the performances. As there is a concurrent PR in #820 , I am in favor of getting the other one merged but we need to integrate some of the changes you proposed (did not realized in the other PR that the no-op was not updated).

Let me know if you think I missed a point somewhere.

_, name = get_func_name(self.func)
msg = '%s cache loaded - %s' % (name, format_time(t))
print(max(0, (80 - len(msg))) * '_' + msg)
out = self.load(*args, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might hurt the performance here.
With this solution, we need to call _get_output_identifier twice and thus hash the argument multiple time. This can be slow in case where large argument are passed.

@tomMoral
Copy link
Copy Markdown
Contributor

This feature has been added in #820 so I am closing this PR.

@tomMoral tomMoral closed this Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants