Add a public is_cached method by alec-deason · Pull Request #407 · joblib/joblib

alec-deason · 2016-09-30T18:57:58Z

Make it so that client code can check if a call to a wrapped function will require the recomputation or if it can be loaded from disk.

My use case for this is allow a distributed system to more intelligently manage locks around the calculation of cache objects.

…it can be loaded from disk

aabadie

Thanks for proposing this PR @alec-deason. I think it kind of makes sense. Btw, I made a few comments on the changes.
I'm also wondering if it's worh adding a dummy is_cached function to the NotMemorizedFunc object as it may happen that a cached function is not memorized (because cachedir is None, see this line)

Maybe @GaelVaroquaux has an opinion on this new is_cached method ?

aabadie · 2016-10-01T14:20:43Z

joblib/memory.py

+        output_dir, argument_hash = self._get_output_dir(*args, **kwargs)
+        return self._needs_call(output_dir, argument_hash)
+
+    def _needs_call(self, output_dir, argument_hash):


argument_hash is not used by the implementation of this function. I think it can be removed.

Yep, totally.

aabadie · 2016-10-01T14:21:14Z

joblib/memory.py

+    def _needs_call(self, output_dir, argument_hash):
+        """Check if the function needs to be called or if its output can be
+        loaded from cache
+


can you add a Parameters section in the docstring?

aabadie · 2016-10-01T14:22:33Z

joblib/memory.py

+        cached: bool
+            Whether or not the call is already cached
+        """
+        output_dir, argument_hash = self._get_output_dir(*args, **kwargs)


see my comment below, argument_hash is not used after. The variable can be replaces by _

can you replace argument_hash with _ as it is not used ?

aabadie · 2016-10-01T14:23:48Z

joblib/memory.py

                out, metadata = self.call(*args, **kwargs)
                argument_hash = None
+        else:
+            # FIXME: The statements below should be try/excepted


I'm unsure about this. The comment is in the original code that I modified and I didn't want to loose it because it's not clear to me how important it is so I moved it along with the block it was referring to. I can dig more deeply if you like but my goal here was to keep my changes minimal since I'm a tourist in this code base.

Ah yes, sorry I missed the line in the previous code.

lesteve · 2016-10-03T09:48:04Z

My use case for this is allow a distributed system to more intelligently manage locks around the calculation of cache objects.

Out of interest, can you expand on this a little bit so I get a better feeling for your use case?

alec-deason · 2016-10-03T18:27:28Z

@lesteve I have a group of workers who share a cache where the objects take a non-trivial amount of time to read in. I'd like for the first worker to request an object to be able to set a lock while it calculates and then once the object is on disk release the lock so the other workers can read in parallel. In order to do that I need to know if calling the function will cause the object to be recalculated.

lesteve · 2016-10-04T08:01:47Z

Just in case, you are aware that joblib supports your use case? What I mean is that joblib is supposed to be robust to different processes trying to write to the same cache location. Let us know if you found problems with this!

I am guessing you are worried about wasted computation if all the workers are trying to compute around the same time a result that has not yet been cached, is that right?

alec-deason · 2016-10-04T16:31:20Z

Yes, and I've used it without locking previously. The computation time and load on external systems is my primary concern. In general I prewarm the cache so none of this is a problem but I'd like to have the tools to guarantee that it won't stamped if for some reason the cache is cold.

alec-deason · 2016-10-04T21:48:22Z

The error in Travis appears to be an internal timeout not related to this change. Is there a way to rerun it without pushing a dummy commit or such? I don't seem to have the necessary permission.

lesteve · 2016-10-05T05:55:05Z

Is there a way to rerun it without pushing a dummy commit or such? I don't seem to have the necessary permission.

I restarted the build manually. git commit --amend + force push is a sneaky way to get all the CIs to rerun. To rebuild via the Travis web UI you need to have rights on the project indeed.

alec-deason · 2016-10-05T06:09:38Z

git commit --amend + force push is a sneaky way to get all the CIs to rerun.

Good to know. Thanks.

lesteve · 2016-10-06T08:23:42Z

@GaelVaroquaux @ogrisel do you have an opinion on this one?

The obvious problem I see with this one is that is_cached may give false positives, e.g. if output.pkl is corrupted somehow. In other words there is no way of knowing if the result is cached other than loading it.

alec-deason · 2016-10-06T17:54:14Z

The obvious problem I see with this one is that is_cached may give false positives, e.g. if output.pkl is corrupted somehow. In other words there is no way of knowing if the result is cached other than loading it.

Another approach which I think could address this would be to have a callback that get's triggered when cache read and cache recomputation begin so that client code has an opportunity to fiddle it's locks (or send a progress notification or whatever the need is). That's a slightly more complicated implementation, but only slightly. I'm happy to write it up.

Also, at least for my use case, getting occasional false positives out of is_cached is acceptable. We'd wan't to make it clear to users that it isn't 100% guaranteed.

mfranzs · 2018-04-27T02:14:41Z

What's the status on this? Can it be added?

Alec Deason added 3 commits September 30, 2016 11:44

Add a method for checking if a function will need to be called or if …

8958ec4

…it can be loaded from disk

Fix style issues

8ab947d

remove stray space

0d0cbfb

aabadie suggested changes Oct 1, 2016

View reviewed changes

aabadie added enhancement need Review labels Oct 1, 2016

alec-deason added 2 commits October 1, 2016 11:16

Remove unneeded argument and improve docstring

d866d91

Add is_cached to NotMemorizedFunc so it's symmetric

e74f560

Removing name for unused value

8dcc895

aabadie approved these changes Oct 6, 2016

View reviewed changes

lesteve mentioned this pull request Apr 26, 2018

Checking if a cached result exists? #668

Closed

nwilming mentioned this pull request Jul 31, 2018

Implement in_store method to check if a result is already in store #730

Closed

tomMoral deleted the branch joblib:master April 3, 2024 08:37

tomMoral closed this Apr 3, 2024

Conversation

alec-deason commented Sep 30, 2016

Uh oh!

aabadie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aabadie Oct 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve commented Oct 3, 2016

Uh oh!

alec-deason commented Oct 3, 2016

Uh oh!

lesteve commented Oct 4, 2016

Uh oh!

alec-deason commented Oct 4, 2016

Uh oh!

alec-deason commented Oct 4, 2016

Uh oh!

lesteve commented Oct 5, 2016

Uh oh!

alec-deason commented Oct 5, 2016

Uh oh!

lesteve commented Oct 6, 2016

Uh oh!

alec-deason commented Oct 6, 2016

Uh oh!

mfranzs commented Apr 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aabadie Oct 5, 2016 •

edited

Loading