Add a public is_cached method#407
Add a public is_cached method#407alec-deason wants to merge 6 commits intojoblib:masterfrom alec-deason:master
Conversation
…it can be loaded from disk
aabadie
left a comment
There was a problem hiding this comment.
Thanks for proposing this PR @alec-deason. I think it kind of makes sense. Btw, I made a few comments on the changes.
I'm also wondering if it's worh adding a dummy is_cached function to the NotMemorizedFunc object as it may happen that a cached function is not memorized (because cachedir is None, see this line)
Maybe @GaelVaroquaux has an opinion on this new is_cached method ?
joblib/memory.py
Outdated
| output_dir, argument_hash = self._get_output_dir(*args, **kwargs) | ||
| return self._needs_call(output_dir, argument_hash) | ||
|
|
||
| def _needs_call(self, output_dir, argument_hash): |
There was a problem hiding this comment.
argument_hash is not used by the implementation of this function. I think it can be removed.
| def _needs_call(self, output_dir, argument_hash): | ||
| """Check if the function needs to be called or if its output can be | ||
| loaded from cache | ||
|
|
There was a problem hiding this comment.
can you add a Parameters section in the docstring?
joblib/memory.py
Outdated
| cached: bool | ||
| Whether or not the call is already cached | ||
| """ | ||
| output_dir, argument_hash = self._get_output_dir(*args, **kwargs) |
There was a problem hiding this comment.
see my comment below, argument_hash is not used after. The variable can be replaces by _
There was a problem hiding this comment.
can you replace argument_hash with _ as it is not used ?
| out, metadata = self.call(*args, **kwargs) | ||
| argument_hash = None | ||
| else: | ||
| # FIXME: The statements below should be try/excepted |
There was a problem hiding this comment.
I'm unsure about this. The comment is in the original code that I modified and I didn't want to loose it because it's not clear to me how important it is so I moved it along with the block it was referring to. I can dig more deeply if you like but my goal here was to keep my changes minimal since I'm a tourist in this code base.
There was a problem hiding this comment.
Ah yes, sorry I missed the line in the previous code.
Out of interest, can you expand on this a little bit so I get a better feeling for your use case? |
|
@lesteve I have a group of workers who share a cache where the objects take a non-trivial amount of time to read in. I'd like for the first worker to request an object to be able to set a lock while it calculates and then once the object is on disk release the lock so the other workers can read in parallel. In order to do that I need to know if calling the function will cause the object to be recalculated. |
|
Just in case, you are aware that joblib supports your use case? What I mean is that joblib is supposed to be robust to different processes trying to write to the same cache location. Let us know if you found problems with this! I am guessing you are worried about wasted computation if all the workers are trying to compute around the same time a result that has not yet been cached, is that right? |
|
Yes, and I've used it without locking previously. The computation time and load on external systems is my primary concern. In general I prewarm the cache so none of this is a problem but I'd like to have the tools to guarantee that it won't stamped if for some reason the cache is cold. |
|
The error in Travis appears to be an internal timeout not related to this change. Is there a way to rerun it without pushing a dummy commit or such? I don't seem to have the necessary permission. |
I restarted the build manually. git commit --amend + force push is a sneaky way to get all the CIs to rerun. To rebuild via the Travis web UI you need to have rights on the project indeed. |
Good to know. Thanks. |
|
@GaelVaroquaux @ogrisel do you have an opinion on this one? The obvious problem I see with this one is that |
Another approach which I think could address this would be to have a callback that get's triggered when cache read and cache recomputation begin so that client code has an opportunity to fiddle it's locks (or send a progress notification or whatever the need is). That's a slightly more complicated implementation, but only slightly. I'm happy to write it up. Also, at least for my use case, getting occasional false positives out of is_cached is acceptable. We'd wan't to make it clear to users that it isn't 100% guaranteed. |
|
What's the status on this? Can it be added? |
Make it so that client code can check if a call to a wrapped function will require the recomputation or if it can be loaded from disk.
My use case for this is allow a distributed system to more intelligently manage locks around the calculation of cache objects.