Skip to content

Add a public is_cached method#407

Closed
alec-deason wants to merge 6 commits intojoblib:masterfrom
alec-deason:master
Closed

Add a public is_cached method#407
alec-deason wants to merge 6 commits intojoblib:masterfrom
alec-deason:master

Conversation

@alec-deason
Copy link
Copy Markdown

Make it so that client code can check if a call to a wrapped function will require the recomputation or if it can be loaded from disk.

My use case for this is allow a distributed system to more intelligently manage locks around the calculation of cache objects.

Copy link
Copy Markdown
Contributor

@aabadie aabadie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing this PR @alec-deason. I think it kind of makes sense. Btw, I made a few comments on the changes.
I'm also wondering if it's worh adding a dummy is_cached function to the NotMemorizedFunc object as it may happen that a cached function is not memorized (because cachedir is None, see this line)

Maybe @GaelVaroquaux has an opinion on this new is_cached method ?

joblib/memory.py Outdated
output_dir, argument_hash = self._get_output_dir(*args, **kwargs)
return self._needs_call(output_dir, argument_hash)

def _needs_call(self, output_dir, argument_hash):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argument_hash is not used by the implementation of this function. I think it can be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally.

def _needs_call(self, output_dir, argument_hash):
"""Check if the function needs to be called or if its output can be
loaded from cache

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section in the docstring?

joblib/memory.py Outdated
cached: bool
Whether or not the call is already cached
"""
output_dir, argument_hash = self._get_output_dir(*args, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below, argument_hash is not used after. The variable can be replaces by _

Copy link
Copy Markdown
Contributor

@aabadie aabadie Oct 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace argument_hash with _ as it is not used ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

out, metadata = self.call(*args, **kwargs)
argument_hash = None
else:
# FIXME: The statements below should be try/excepted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about this. The comment is in the original code that I modified and I didn't want to loose it because it's not clear to me how important it is so I moved it along with the block it was referring to. I can dig more deeply if you like but my goal here was to keep my changes minimal since I'm a tourist in this code base.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry I missed the line in the previous code.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Oct 3, 2016

My use case for this is allow a distributed system to more intelligently manage locks around the calculation of cache objects.

Out of interest, can you expand on this a little bit so I get a better feeling for your use case?

@alec-deason
Copy link
Copy Markdown
Author

@lesteve I have a group of workers who share a cache where the objects take a non-trivial amount of time to read in. I'd like for the first worker to request an object to be able to set a lock while it calculates and then once the object is on disk release the lock so the other workers can read in parallel. In order to do that I need to know if calling the function will cause the object to be recalculated.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Oct 4, 2016

Just in case, you are aware that joblib supports your use case? What I mean is that joblib is supposed to be robust to different processes trying to write to the same cache location. Let us know if you found problems with this!

I am guessing you are worried about wasted computation if all the workers are trying to compute around the same time a result that has not yet been cached, is that right?

@alec-deason
Copy link
Copy Markdown
Author

Yes, and I've used it without locking previously. The computation time and load on external systems is my primary concern. In general I prewarm the cache so none of this is a problem but I'd like to have the tools to guarantee that it won't stamped if for some reason the cache is cold.

@alec-deason
Copy link
Copy Markdown
Author

The error in Travis appears to be an internal timeout not related to this change. Is there a way to rerun it without pushing a dummy commit or such? I don't seem to have the necessary permission.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Oct 5, 2016

Is there a way to rerun it without pushing a dummy commit or such? I don't seem to have the necessary permission.

I restarted the build manually. git commit --amend + force push is a sneaky way to get all the CIs to rerun. To rebuild via the Travis web UI you need to have rights on the project indeed.

@alec-deason
Copy link
Copy Markdown
Author

git commit --amend + force push is a sneaky way to get all the CIs to rerun.

Good to know. Thanks.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Oct 6, 2016

@GaelVaroquaux @ogrisel do you have an opinion on this one?

The obvious problem I see with this one is that is_cached may give false positives, e.g. if output.pkl is corrupted somehow. In other words there is no way of knowing if the result is cached other than loading it.

@alec-deason
Copy link
Copy Markdown
Author

The obvious problem I see with this one is that is_cached may give false positives, e.g. if output.pkl is corrupted somehow. In other words there is no way of knowing if the result is cached other than loading it.

Another approach which I think could address this would be to have a callback that get's triggered when cache read and cache recomputation begin so that client code has an opportunity to fiddle it's locks (or send a progress notification or whatever the need is). That's a slightly more complicated implementation, but only slightly. I'm happy to write it up.

Also, at least for my use case, getting occasional false positives out of is_cached is acceptable. We'd wan't to make it clear to users that it isn't 100% guaranteed.

@mfranzs
Copy link
Copy Markdown

mfranzs commented Apr 27, 2018

What's the status on this? Can it be added?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants