FIX make resource_tracker compatible with py3.13.7+#461
FIX make resource_tracker compatible with py3.13.7+#461tomMoral merged 4 commits intojoblib:masterfrom
Conversation
|
I ran the tests for Python 3.13.7 on both Windows and Linux and they pass. Mentioning this because the CI is using conda to install Python which means that it is not going to test the Python 3.13.7 issue (but of course it still makes sure that everything still works with older Python versions). |
|
I will also test on macos. Can you please add a changelog entry in the meantime? I think we want to do a minor release just for this fix. |
There was a problem hiding this comment.
I had a quick look it seems like the diff is simpler than I thought (having looked a bit at CPython Lib/multiprocessing/resource_tracker.py in the past few days).
My understanding is that you took the Lib/multiprocessing/resource_tracker.py Python 3.13.7 code (e.g. _launch/_teardown_dead_process/_ensure_running_and_write and adapted it for loky Windows support)
I guess you still need to override ensure_running because this is going to be called for Python <= 3.13.6 when using register (_ensure_running_and_write is the one getting called instead in Python 3.13.7 when using register), see #459 (comment).
You also left some things out (like the re-entrant logic) but I think it's fine to fix the issue first and we can maybe look at the re-entrant logic later.
loky/backend/resource_tracker.py
Outdated
| return self._ensure_running_and_write() | ||
|
|
||
| def _teardown_dead_process(self): | ||
| # Backward compatibility for python version before 3.13.7 |
There was a problem hiding this comment.
Not sure how this is related to backward compatibility, but maybe I am missing something ...
There was a problem hiding this comment.
I expanded on this comment: we need to add this function so it can be called before python 3.13.7, and to handle compat with windows.
|
When I run on macOS with python 3.13.7 on this branch, then I get a freeze on EDIT: when I run the I will try to get |
|
Actually, if I wait long enough then the builtin timeout mechanism of the test works: Details________________________________________________________________ TestLokyBackend.test_terminate _________________________________________________________________
self = <tests.test_loky_backend.TestLokyBackend object at 0x103b67a70>
def test_terminate(self):
manager = self.Manager()
event = manager.Event()
p = self.Process(target=self._test_terminate, args=(event,))
p.daemon = True
p.start()
assert p.is_alive()
assert p in self.active_children()
assert p.exitcode is None
join = TimingWrapper(p.join)
assert join(0) is None
join.assert_timing_almost_zero()
assert p.is_alive()
assert join(-1) is None
join.assert_timing_almost_zero()
assert p.is_alive()
# wait for child process to be fully setup
event.wait(5)
p.terminate()
MAX_JOIN_TIME = 10
if hasattr(signal, "alarm"):
# On the Gentoo buildbot waitpid() often seems to block forever.
# We use alarm() to interrupt it if it blocks for too long.
def handler(*args):
raise RuntimeError(f"join took too long: {p}")
old_handler = signal.signal(signal.SIGALRM, handler)
try:
signal.alarm(MAX_JOIN_TIME)
assert join() is None
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, old_handler)
else:
assert join() is None
> join.assert_timing_lower_than(MAX_JOIN_TIME)
MAX_JOIN_TIME = 10
event = <EventProxy object, typeid 'Event' at 0x103caba10>
handler = <function TestLokyBackend.test_terminate.<locals>.handler at 0x103e60cc0>
join = <tests.utils.TimingWrapper object at 0x103cabb60>
manager = <multiprocessing.managers.SyncManager object at 0x103cab8c0>
old_handler = <function pytest_timeout_set_timer.<locals>.handler at 0x103cb8040>
p = <LokyProcess name='LokyProcess-13' pid=80182 parent=80030 stopped exitcode=0 daemon>
self = <tests.test_loky_backend.TestLokyBackend object at 0x103b67a70>
tests/test_loky_backend.py:340:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <tests.utils.TimingWrapper object at 0x103cabb60>, delay = 10
def assert_timing_lower_than(self, delay):
msg = (
f"expected duration lower than {delay:.3f}s, "
f"got {self.elapsed:.3f}s"
)
> assert self.elapsed < delay, msg
^^^^^^^^^^^^^^^^^^^^
E AssertionError: expected duration lower than 10.000s, got 100.032s
delay = 10
msg = 'expected duration lower than 10.000s, got 100.032s'
self = <tests.utils.TimingWrapper object at 0x103cabb60>
tests/utils.py:101: AssertionError
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] launched python with pid 80181 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'SyncManager-12', '--pipe', '16']
[INFO:SyncManager-12:MainThread] child process calling self.run()
[INFO:SyncManager-12:MainThread] manager serving at '/var/folders/93/0_k_9bh97fj_15hzk2bhy5r00000gn/T/pymp-4x8da0hs/sock-q07jd17_'
[DEBUG:MainProcess:MainThread] requesting creation of a shared 'Event' object
[DEBUG:SyncManager-12:Thread-2 (handle_request)] 'Event' callable returned object with id '104fe5220'
[DEBUG:MainProcess:MainThread] INCREF '104fe5220'
[DEBUG:MainProcess:MainThread] launched python with pid 80182 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-13', '--pipe', '14']
[DEBUG:MainProcess:MainThread] thread 'MainThread' does not own a connection
[DEBUG:MainProcess:MainThread] making connection to manager
[DEBUG:SyncManager-12:MainProcess] starting server thread to service 'MainProcess'
[DEBUG:LokyProcess-13:MainThread] INCREF '104fe5220'
[INFO:LokyProcess-13:MainThread] child process calling self.run()
[DEBUG:LokyProcess-13:MainThread] thread 'MainThread' does not own a connection
[DEBUG:LokyProcess-13:MainThread] making connection to manager
[DEBUG:SyncManager-12:LokyProcess-13] starting server thread to service 'LokyProcess-13'
[INFO:LokyProcess-13:MainThread] process exiting with exitcode 0
[INFO:LokyProcess-13:MainThread] process shutting down
[DEBUG:LokyProcess-13:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-13:MainThread] DECREF '104fe5220'
[DEBUG:LokyProcess-13:MainThread] thread 'MainThread' has no more proxies so closing conn
[DEBUG:LokyProcess-13:MainThread] running the remaining "atexit" finalizers
[DEBUG:SyncManager-12:LokyProcess-13] got EOF -- exiting thread serving 'LokyProcess-13'
==================================================================== short test summary info ====================================================================
FAILED tests/test_loky_backend.py::TestLokyBackend::test_terminate - AssertionError: expected duration lower than 10.000s, got 100.032s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================== 1 failed, 35 passed in 102.62s (0:01:42) ============================================================
[INFO:MainProcess:MainThread] process shutting down
[DEBUG:MainProcess:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:MainProcess:MainThread] DECREF '104fe5220'
[DEBUG:MainProcess:MainThread] ... decref failed [Errno 61] Connection refused
[DEBUG:MainProcess:MainThread] thread 'MainThread' has no more proxies so closing conn
[DEBUG:MainProcess:MainThread] running the remaining "atexit" finalizersUnfortunately I cannot spot the root cause of the freeze from this output alone. |
|
I managed to get more info using the following command: Detailsbut this is not the full story because the pytest process is still stuck after that and I have to continue manually killing python processes. |
|
If I skip this test then the next failure is: tests/test_loky_backend.py::TestLokyBackend::test_wait_sentinel FAILED [ 12%]
=========================================================================== FAILURES ============================================================================
______________________________________________________________ TestLokyBackend.test_wait_sentinel _______________________________________________________________
self = <tests.test_loky_backend.TestLokyBackend object at 0x1074c28f0>
def test_wait_sentinel(self):
p = self.Process(target=self._test_wait_sentinel)
with pytest.raises(ValueError):
p.sentinel
p.start()
assert isinstance(p.sentinel, int)
assert not wait([p.sentinel], timeout=0.0)
assert wait([p.sentinel], timeout=5), p.exitcode
expected_code = 15 if sys.platform == "win32" else -15
p.join() # force refresh of p.exitcode
> assert p.exitcode == expected_code
E AssertionError: assert 0 == -15
E + where 0 = <LokyProcess name='LokyProcess-16' pid=90410 parent=90236 stopped exitcode=0>.exitcode
expected_code = -15
p = <LokyProcess name='LokyProcess-16' pid=90410 parent=90236 stopped exitcode=0>
self = <tests.test_loky_backend.TestLokyBackend object at 0x1074c28f0>
tests/test_loky_backend.py:423: AssertionError
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] launched python with pid 90410 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-16', '--pipe', '13']
[INFO:LokyProcess-16:MainThread] child process calling self.run()
[INFO:LokyProcess-16:MainThread] process exiting with exitcode 0
[INFO:LokyProcess-16:MainThread] process shutting down
[DEBUG:LokyProcess-16:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-16:MainThread] running the remaining "atexit" finalizers
==================================================================== short test summary info ====================================================================
FAILED tests/test_loky_backend.py::TestLokyBackend::test_wait_sentinel - AssertionError: assert 0 == -15
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed, 38 passed, 1 deselected in 3.47s ===========================================================
[INFO:MainProcess:MainThread] process shutting down
[DEBUG:MainProcess:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:MainProcess:MainThread] running the remaining "atexit" finalizers |
|
I can also reproduce the If I run the resource tracker tests (with |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
The gist of this fix is to make sure
ensure_runningandensure_running_and_writecall our version of launch, which handles:resource_tracker, which does some ref counting for the folders/files.Maybe to avoid future headaches, we would like to upstream:
resource_trackerThe later part is fairly easy (just adapting the launch to duplicate the pipe and handling specific actions) but the former is more involved as it requires an API....
Fixes #459