Skip to content

Nanny restarts worker once if it fails to start / doesn't clean up processes on startup failure #6426

@gjoseph92

Description

@gjoseph92

If the worker fails during startup, the Nanny will try to restart it once. This is not too important, but it's a bug and it makes testing difficult.

If we slightly modify test_worker_start_exception to verify that "Restarting worker" isn't in the logs...

@gen_cluster(nthreads=[])
async def test_worker_start_exception(s):
# make sure this raises the right Exception:
with raises_with_cause(RuntimeError, None, StartException, None):
async with Nanny(s.address, worker_class=BrokenWorker) as n:
pass

@gen_cluster(nthreads=[])
async def test_worker_start_exception(s):
    with captured_logger(logger="distributed.nanny", level=logging.DEBUG) as logs:
        # make sure this raises the right Exception:
        with raises_with_cause(RuntimeError, None, StartException, None):
            async with Nanny(s.address, worker_class=BrokenWorker) as n:
                pass

    print("HERE")
    print(logs.getvalue())
    assert "Restarting worker" not in logs.getvalue()
    assert False # fail test so we can see stdout

...it passes.

Yet if we run it, the message we didn't want to see actually is in the logs (but not in the captured logs)!
(😬 emojis added for visibility to the relevant spots)

(dask-distributed) gabe dev/distributed ‹fix-6363*› » pytest distributed/tests/test_nanny.py::test_worker_start_exception -s                                               1 ↵
============================================================================= test session starts =============================================================================
platform darwin -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/gabe/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /Users/gabe/dev/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, xdist-2.3.0, asyncio-0.12.0, timeout-1.4.2, forked-1.3.0, cov-3.0.0, rerunfailures-10.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 1 item                                                                                                                                                              

distributed/tests/test_nanny.py::test_worker_start_exception 2022-05-23 16:11:06,515 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-05-23 16:11:06,519 - distributed.scheduler - INFO - State start
2022-05-23 16:11:06,522 - distributed.scheduler - INFO - Clear task state
2022-05-23 16:11:06,523 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:61231
2022-05-23 16:11:06,523 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:61230
2022-05-23 16:11:06,531 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:61232'
2022-05-23 16:11:07,260 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 872, in run
    await worker
  File "/Users/gabe/dev/distributed/distributed/tests/test_nanny.py", line 482, in start
    raise StartException("broken")
test_nanny.StartException: broken
2022-05-23 16:11:07,283 - distributed.nanny - ERROR - Failed while trying to start worker process: broken
2022-05-23 16:11:07,284 - distributed.nanny - ERROR - Failed to connect to process
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
    msg = await self._wait_until_connected(uid)
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
    raise msg["exception"]
test_nanny.StartException: broken
2022-05-23 16:11:07,285 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 425, in instantiate
    result = await self.process.start()
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
    msg = await self._wait_until_connected(uid)
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
    raise msg["exception"]
test_nanny.StartException: broken
2022-05-23 16:11:07,285 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:61232'.
HERE  # <--- 😬😬😬 OUR LOGGING CAPTURE ENDS. `async with` block has exited
        Start Nanny at: 'tcp://127.0.0.1:61232'
Failed while trying to start worker process: broken
Failed to connect to process
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
    msg = await self._wait_until_connected(uid)
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
    raise msg["exception"]
test_nanny.StartException: broken
Failed to start process
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 425, in instantiate
    result = await self.process.start()
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
    msg = await self._wait_until_connected(uid)
  File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
    raise msg["exception"]
test_nanny.StartException: broken
Closing Nanny at 'tcp://127.0.0.1:61232'.

Dumped cluster state to test_cluster_dump/test_worker_start_exception.yaml
2022-05-23 16:11:07,599 - distributed.nanny - INFO - Worker process 37919 was killed by signal 15
2022-05-23 16:11:07,601 - distributed.nanny - WARNING - Restarting worker  # <--- 😬😬😬 WORKER IS RESTARTING
2022-05-23 16:11:07,602 - distributed.scheduler - INFO - Scheduler closing...
2022-05-23 16:11:07,612 - distributed.scheduler - INFO - Scheduler closing all comms
FAILED

================================================================================== FAILURES ===================================================================================
_________________________________________________________________________ test_worker_start_exception _________________________________________________________________________

s = <Scheduler 'tcp://127.0.0.1:61231', workers: 0, cores: 0, tasks: 0>

    @gen_cluster(nthreads=[])
    async def test_worker_start_exception(s):
        with captured_logger(logger="distributed.nanny", level=logging.DEBUG) as logs:
            # make sure this raises the right Exception:
            with raises_with_cause(RuntimeError, None, StartException, None):
                async with Nanny(s.address, worker_class=BrokenWorker) as n:
                    pass
    
        print("HERE")
        print(logs.getvalue())
        assert "Restarting worker" not in logs.getvalue()
>       assert False # fail test so we can see stdout
E       assert False

distributed/tests/test_nanny.py:511: AssertionError
============================================================================ slowest 20 durations =============================================================================
1.11s call     distributed/tests/test_nanny.py::test_worker_start_exception

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================================================================== short test summary info ===========================================================================
FAILED distributed/tests/test_nanny.py::test_worker_start_exception - assert False
============================================================================== 1 failed in 1.37s ==============================================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions