Skip to content

FIX make resource_tracker compatible with py3.13.7+#461

Merged
tomMoral merged 4 commits intojoblib:masterfrom
tomMoral:FIX_resource_tracker
Aug 27, 2025
Merged

FIX make resource_tracker compatible with py3.13.7+#461
tomMoral merged 4 commits intojoblib:masterfrom
tomMoral:FIX_resource_tracker

Conversation

@tomMoral
Copy link
Copy Markdown
Contributor

@tomMoral tomMoral commented Aug 26, 2025

The gist of this fix is to make sure ensure_running and ensure_running_and_write call our version of launch, which handles:

  • launching on windows
  • runs our loop for the resource_tracker, which does some ref counting for the folders/files.

Maybe to avoid future headaches, we would like to upstream:

  • registering new resources in the default resource_tracker
  • Making it runnable on windows.

The later part is fairly easy (just adapting the launch to duplicate the pipe and handling specific actions) but the former is more involved as it requires an API....

Fixes #459

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Aug 26, 2025

I ran the tests for Python 3.13.7 on both Windows and Linux and they pass.

Mentioning this because the CI is using conda to install Python which means that it is not going to test the Python 3.13.7 issue (but of course it still makes sure that everything still works with older Python versions).

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

I will also test on macos. Can you please add a changelog entry in the meantime? I think we want to do a minor release just for this fix.

Copy link
Copy Markdown
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look it seems like the diff is simpler than I thought (having looked a bit at CPython Lib/multiprocessing/resource_tracker.py in the past few days).

My understanding is that you took the Lib/multiprocessing/resource_tracker.py Python 3.13.7 code (e.g. _launch/_teardown_dead_process/_ensure_running_and_write and adapted it for loky Windows support)

I guess you still need to override ensure_running because this is going to be called for Python <= 3.13.6 when using register (_ensure_running_and_write is the one getting called instead in Python 3.13.7 when using register), see #459 (comment).

You also left some things out (like the re-entrant logic) but I think it's fine to fix the issue first and we can maybe look at the re-entrant logic later.

return self._ensure_running_and_write()

def _teardown_dead_process(self):
# Backward compatibility for python version before 3.13.7
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is related to backward compatibility, but maybe I am missing something ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded on this comment: we need to add this function so it can be called before python 3.13.7, and to handle compat with windows.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

When I run on macOS with python 3.13.7 on this branch, then I get a freeze on tests/test_loky_backend.py::TestLokyBackend::test_terminate. I have to kill -9 manually.

EDIT: when I run the test_terminate alone, it passes. So there is a side effect with other previous tests (that all pass as well...).

I will try to get faulthander traceback but unfortunately pytest-faulthandler does not seem to load for some reason.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

Actually, if I wait long enough then the builtin timeout mechanism of the test works:

Details
________________________________________________________________ TestLokyBackend.test_terminate _________________________________________________________________

self = <tests.test_loky_backend.TestLokyBackend object at 0x103b67a70>

    def test_terminate(self):
        manager = self.Manager()
        event = manager.Event()
    
        p = self.Process(target=self._test_terminate, args=(event,))
        p.daemon = True
        p.start()
    
        assert p.is_alive()
        assert p in self.active_children()
        assert p.exitcode is None
    
        join = TimingWrapper(p.join)
    
        assert join(0) is None
        join.assert_timing_almost_zero()
        assert p.is_alive()
    
        assert join(-1) is None
        join.assert_timing_almost_zero()
        assert p.is_alive()
    
        # wait for child process to be fully setup
        event.wait(5)
    
        p.terminate()
    
        MAX_JOIN_TIME = 10
        if hasattr(signal, "alarm"):
            # On the Gentoo buildbot waitpid() often seems to block forever.
            # We use alarm() to interrupt it if it blocks for too long.
            def handler(*args):
                raise RuntimeError(f"join took too long: {p}")
    
            old_handler = signal.signal(signal.SIGALRM, handler)
            try:
                signal.alarm(MAX_JOIN_TIME)
                assert join() is None
            finally:
                signal.alarm(0)
                signal.signal(signal.SIGALRM, old_handler)
        else:
            assert join() is None
    
>       join.assert_timing_lower_than(MAX_JOIN_TIME)

MAX_JOIN_TIME = 10
event      = <EventProxy object, typeid 'Event' at 0x103caba10>
handler    = <function TestLokyBackend.test_terminate.<locals>.handler at 0x103e60cc0>
join       = <tests.utils.TimingWrapper object at 0x103cabb60>
manager    = <multiprocessing.managers.SyncManager object at 0x103cab8c0>
old_handler = <function pytest_timeout_set_timer.<locals>.handler at 0x103cb8040>
p          = <LokyProcess name='LokyProcess-13' pid=80182 parent=80030 stopped exitcode=0 daemon>
self       = <tests.test_loky_backend.TestLokyBackend object at 0x103b67a70>

tests/test_loky_backend.py:340: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.utils.TimingWrapper object at 0x103cabb60>, delay = 10

    def assert_timing_lower_than(self, delay):
        msg = (
            f"expected duration lower than {delay:.3f}s, "
            f"got {self.elapsed:.3f}s"
        )
>       assert self.elapsed < delay, msg
               ^^^^^^^^^^^^^^^^^^^^
E       AssertionError: expected duration lower than 10.000s, got 100.032s

delay      = 10
msg        = 'expected duration lower than 10.000s, got 100.032s'
self       = <tests.utils.TimingWrapper object at 0x103cabb60>

tests/utils.py:101: AssertionError
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] launched python with pid 80181 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'SyncManager-12', '--pipe', '16']
[INFO:SyncManager-12:MainThread] child process calling self.run()
[INFO:SyncManager-12:MainThread] manager serving at '/var/folders/93/0_k_9bh97fj_15hzk2bhy5r00000gn/T/pymp-4x8da0hs/sock-q07jd17_'
[DEBUG:MainProcess:MainThread] requesting creation of a shared 'Event' object
[DEBUG:SyncManager-12:Thread-2 (handle_request)] 'Event' callable returned object with id '104fe5220'
[DEBUG:MainProcess:MainThread] INCREF '104fe5220'
[DEBUG:MainProcess:MainThread] launched python with pid 80182 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-13', '--pipe', '14']
[DEBUG:MainProcess:MainThread] thread 'MainThread' does not own a connection
[DEBUG:MainProcess:MainThread] making connection to manager
[DEBUG:SyncManager-12:MainProcess] starting server thread to service 'MainProcess'
[DEBUG:LokyProcess-13:MainThread] INCREF '104fe5220'
[INFO:LokyProcess-13:MainThread] child process calling self.run()
[DEBUG:LokyProcess-13:MainThread] thread 'MainThread' does not own a connection
[DEBUG:LokyProcess-13:MainThread] making connection to manager
[DEBUG:SyncManager-12:LokyProcess-13] starting server thread to service 'LokyProcess-13'
[INFO:LokyProcess-13:MainThread] process exiting with exitcode 0
[INFO:LokyProcess-13:MainThread] process shutting down
[DEBUG:LokyProcess-13:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-13:MainThread] DECREF '104fe5220'
[DEBUG:LokyProcess-13:MainThread] thread 'MainThread' has no more proxies so closing conn
[DEBUG:LokyProcess-13:MainThread] running the remaining "atexit" finalizers
[DEBUG:SyncManager-12:LokyProcess-13] got EOF -- exiting thread serving 'LokyProcess-13'
==================================================================== short test summary info ====================================================================
FAILED tests/test_loky_backend.py::TestLokyBackend::test_terminate - AssertionError: expected duration lower than 10.000s, got 100.032s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
=========================================================== 1 failed, 35 passed in 102.62s (0:01:42) ============================================================
[INFO:MainProcess:MainThread] process shutting down
[DEBUG:MainProcess:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:MainProcess:MainThread] DECREF '104fe5220'
[DEBUG:MainProcess:MainThread] ... decref failed [Errno 61] Connection refused
[DEBUG:MainProcess:MainThread] thread 'MainThread' has no more proxies so closing conn
[DEBUG:MainProcess:MainThread] running the remaining "atexit" finalizers

Unfortunately I cannot spot the root cause of the freeze from this output alone.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

I managed to get more info using the following command:

python -q -X faulthandler -m pytest -o faulthandler_timeout=10 -vlx
Details
tests/test_loky_backend.py::TestLokyBackend::test_terminate Timeout (0:00:10)!
Thread 0x000000016dbbf000 (most recent call first):
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/socket.py", line 295 in accept
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/connection.py", line 626 in accept
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/connection.py", line 480 in accept
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_sharer.py", line 138 in _serve
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 994 in run
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1043 in _bootstrap_inner
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/threading.py", line 1014 in _bootstrap

Thread 0x000000020e24e0c0 (most recent call first):
  File "/Users/ogrisel/code/loky/loky/backend/popen_loky_posix.py", line 58 in poll
  File "/Users/ogrisel/code/loky/loky/backend/popen_loky_posix.py", line 79 in wait
  File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/process.py", line 149 in join
  File "/Users/ogrisel/code/loky/tests/utils.py", line 92 in __call__
  File "/Users/ogrisel/code/loky/tests/test_loky_backend.py", line 333 in test_terminate
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/python.py", line 157 in pytest_pyfunc_call
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/python.py", line 1671 in runtest
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 178 in pytest_runtest_call
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 246 in <lambda>
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 344 in from_call
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 245 in call_and_report
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 136 in runtestprotocol
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/runner.py", line 117 in pytest_runtest_protocol
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/main.py", line 367 in pytest_runtestloop
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/main.py", line 343 in _main
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/main.py", line 289 in wrap_session
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/main.py", line 336 in pytest_cmdline_main
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_callers.py", line 121 in _multicall
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pluggy/_hooks.py", line 512 in __call__
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/Users/ogrisel/uvvenv/lib/python3.13/site-packages/pytest/__main__.py", line 9 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main

but this is not the full story because the pytest process is still stuck after that and I have to continue manually killing python processes.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

If I skip this test then the next failure is:

python -q -X faulthandler -m pytest -o faulthandler_timeout=10 -vlx -k "not test_terminate"
tests/test_loky_backend.py::TestLokyBackend::test_wait_sentinel FAILED                                                                                    [ 12%]

=========================================================================== FAILURES ============================================================================
______________________________________________________________ TestLokyBackend.test_wait_sentinel _______________________________________________________________

self = <tests.test_loky_backend.TestLokyBackend object at 0x1074c28f0>

    def test_wait_sentinel(self):
        p = self.Process(target=self._test_wait_sentinel)
        with pytest.raises(ValueError):
            p.sentinel
        p.start()
        assert isinstance(p.sentinel, int)
        assert not wait([p.sentinel], timeout=0.0)
        assert wait([p.sentinel], timeout=5), p.exitcode
        expected_code = 15 if sys.platform == "win32" else -15
        p.join()  # force refresh of p.exitcode
>       assert p.exitcode == expected_code
E       AssertionError: assert 0 == -15
E        +  where 0 = <LokyProcess name='LokyProcess-16' pid=90410 parent=90236 stopped exitcode=0>.exitcode

expected_code = -15
p          = <LokyProcess name='LokyProcess-16' pid=90410 parent=90236 stopped exitcode=0>
self       = <tests.test_loky_backend.TestLokyBackend object at 0x1074c28f0>

tests/test_loky_backend.py:423: AssertionError
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] launched python with pid 90410 and cmd:
['/Users/ogrisel/uvvenv/bin/python', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-16', '--pipe', '13']
[INFO:LokyProcess-16:MainThread] child process calling self.run()
[INFO:LokyProcess-16:MainThread] process exiting with exitcode 0
[INFO:LokyProcess-16:MainThread] process shutting down
[DEBUG:LokyProcess-16:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-16:MainThread] running the remaining "atexit" finalizers
==================================================================== short test summary info ====================================================================
FAILED tests/test_loky_backend.py::TestLokyBackend::test_wait_sentinel - AssertionError: assert 0 == -15
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================================== 1 failed, 38 passed, 1 deselected in 3.47s ===========================================================
[INFO:MainProcess:MainThread] process shutting down
[DEBUG:MainProcess:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:MainProcess:MainThread] running the remaining "atexit" finalizers

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Aug 26, 2025

I can also reproduce the test_terminate freeze and the test_wait_sentinel failure with Python 3.13.7 on master instead of this branch. This means that we have other problems with recent Python>3.13.5 on macos besides the resource tracker problem (the macos CI is running 3.13.5 and it has no problem, and similarly I cannot reproduce any of those problems with Python 3.13.5 locally).

If I run the resource tracker tests (with pytest -k resource_tracker) on CPython 3.13.7 with this branch, they all pass while many would fail on master. So there is definitely a net improvement with this branch, but this is not enough to fix CPython 3.13.7 support.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@tomMoral tomMoral merged commit 48bf700 into joblib:master Aug 27, 2025
11 of 12 checks passed
@tomMoral tomMoral deleted the FIX_resource_tracker branch August 27, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants