Skip to content

ENH: delegate runtime resource cleanup to resource_tracker#228

Merged
ogrisel merged 33 commits intojoblib:masterfrom
pierreglaser:resource-tracker-refcounting
Feb 19, 2020
Merged

ENH: delegate runtime resource cleanup to resource_tracker#228
ogrisel merged 33 commits intojoblib:masterfrom
pierreglaser:resource-tracker-refcounting

Conversation

@pierreglaser
Copy link
Copy Markdown
Collaborator

@pierreglaser pierreglaser commented Dec 5, 2019

This PR proposes to make resource_tracker in charge of cleaning up shared resources between between Python processes AT RUNTIME (and not only during shutdown as it is currently done). It does so by reference counting objects each time a Python process demands it (e.g means a Python process sends a "REGISTER"/"UNREGISTER" request to the resource_tracker). If an object's refcount drops to 0, the resource_tracker cleans it up. Previously, Python processes themselves would clean up the shared objects, and the resource_tracker would only intervene at the end of the program when all other processes connected to it have exited.

This PR is principally motivated by joblib/joblib#806, that shows the need for a proper way to track shared memmaped files usage between processes, as well as clean them up as soon as possible (and not wait for the shutdown of the joblib worker pool).

The resource tracker is now in charge of cleaning up shared
resources between between Python processes. It does so by
reference counting objects each time a Python process demands it
(this means the Python process sends a "REGISTER"/"UNREGISTER"
request to the resource_tracker). If an object's refcount
drops to 0, the resource tracker cleans it up. Previously,
Python processes themselves would clean up the shared objects,
and the resource_tracker would only intervene at the end of the program
when all other processes connected to it have exited.
Copy link
Copy Markdown
Collaborator Author

@pierreglaser pierreglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing myself to not forget anything because I'm about to do something else afterwards. @ogrisel it's still WIP but feel free to take a look.

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it I think we should still preserve the existing behavior with "at exit" cleanup of registered resources by default. Un-registering should not trigger cleanups but instead make sure that the resource tracker does not collect the resource at all.

So instead we should probably add a new method for handling the "decref" with optional cleanup when the refcounts drops to 0. This new method could be called def maybe_cleanup(self, name).

The incref part can still be achieved via the "register" method (no need to introduce a new method).

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Dec 5, 2019

maybe_collect or even maybe_unlink instead of maybe_cleanup?

The module docstring uses the unlink verb everywhere. So let's stick to it.

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on typos / wording improvements.

_CLEANUP_FUNCS = {
'folder': shutil.rmtree
'folder': shutil.rmtree,
'file': os.unlink
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should wrap those functions to ignore OSError (no such file or directory) if the resource has already been unlinked by another process.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree It would make some test logs less verbose, but I'm not sure about this change yet because it is useful information to know when we overshoot the number of maybe_cleanup calls, especially as I am calibrating the reference counting process of shared numpy arrays in joblib.

@pierreglaser
Copy link
Copy Markdown
Collaborator Author

I still have to push the MAYBE_UNLINK feature that i'm testing locally now.

now that unregister conserves its previous behavior, there is
no reason to change the semaphores finalizer.
while True:
line = f.readline()
if line == b'': # EOF
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why no for loop anymore?

Copy link
Copy Markdown
Collaborator Author

@pierreglaser pierreglaser Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a for-loop, in Python 2, the resource_tracker would wait until it got EOF before iterating over the lines written inside the pipe, processing them all at once (during shutdown). This is a bug that is actually present in loky master (but only for Python 2). I have not looked whether this behavior was documented or not.

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look good. Here are some more comments. Don't forget to add an entry to the changelog.

@pierreglaser pierreglaser force-pushed the resource-tracker-refcounting branch from 3f3e410 to f88ed6d Compare December 12, 2019 20:32
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments:

@pierreglaser pierreglaser force-pushed the resource-tracker-refcounting branch from 608abff to c3d383e Compare January 14, 2020 16:51
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New batch of review:

)
p.wait()
out, err = p.communicate()
assert p.returncode == 0, err
Copy link
Copy Markdown
Collaborator Author

@pierreglaser pierreglaser Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test errors, using this syntax instead of check_output leads to a much cleaner error reporting.

@pierreglaser
Copy link
Copy Markdown
Collaborator Author

@ogrisel all tests are passing. Only coverage is complaining (for now)

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remarks I made in the top level descriptions were marked as resolved where not addressed if I am not mistaken.

Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
@pierreglaser
Copy link
Copy Markdown
Collaborator Author

My bad.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a few suggestions for additional fixes / improvements:

try:
FileNotFoundError = FileNotFoundError
except NameError: # FileNotFoundError is Python 3-only
from loky.backend.semlock import FileNotFoundError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be simplified by always importing:

from loky.backend.semlock import FileNotFoundError

Copy link
Copy Markdown
Collaborator Author

@pierreglaser pierreglaser Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because FileNotFoundError is not a global of loky.backend.semlock for Python 3, it is a item of the __builtins__. At least it does not work on my machine.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

Can you try to merge master into this branch to check whether that can help codedoc? The report link is broken on this PR for some reason.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

Hum codecov is now green... Maybe that was just the PyPy report that was late.

pierreglaser and others added 4 commits February 19, 2020 15:55
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

The pthread import needs to be protected on Windows:

ImportError: cannot import name 'pthread' from 'loky.backend.semlock' (D:\a\1\s\loky\backend\semlock.py

@pierreglaser
Copy link
Copy Markdown
Collaborator Author

pierreglaser commented Feb 19, 2020

For some reason my last commit (fixinng the windows error) are not showing up.
EDIT: github's back on.

@ogrisel ogrisel merged commit 74737d7 into joblib:master Feb 19, 2020
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

Merged! Thank you very much @pierreglaser.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

I forgot to ask you to document the new feature in the changelog.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Feb 19, 2020

@pierreglaser we should now target 2.7.0 in the changelog because this is a new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants