-
-
Notifications
You must be signed in to change notification settings - Fork 756
Closed
Description
If the worker fails during startup, the Nanny will try to restart it once. This is not too important, but it's a bug and it makes testing difficult.
If we slightly modify test_worker_start_exception to verify that "Restarting worker" isn't in the logs...
distributed/distributed/tests/test_nanny.py
Lines 485 to 490 in 4211e5c
| @gen_cluster(nthreads=[]) | |
| async def test_worker_start_exception(s): | |
| # make sure this raises the right Exception: | |
| with raises_with_cause(RuntimeError, None, StartException, None): | |
| async with Nanny(s.address, worker_class=BrokenWorker) as n: | |
| pass |
@gen_cluster(nthreads=[])
async def test_worker_start_exception(s):
with captured_logger(logger="distributed.nanny", level=logging.DEBUG) as logs:
# make sure this raises the right Exception:
with raises_with_cause(RuntimeError, None, StartException, None):
async with Nanny(s.address, worker_class=BrokenWorker) as n:
pass
print("HERE")
print(logs.getvalue())
assert "Restarting worker" not in logs.getvalue()
assert False # fail test so we can see stdout...it passes.
Yet if we run it, the message we didn't want to see actually is in the logs (but not in the captured logs)!
(😬 emojis added for visibility to the relevant spots)
(dask-distributed) gabe dev/distributed ‹fix-6363*› » pytest distributed/tests/test_nanny.py::test_worker_start_exception -s 1 ↵
============================================================================= test session starts =============================================================================
platform darwin -- Python 3.9.5, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /Users/gabe/miniconda3/envs/dask-distributed/bin/python
cachedir: .pytest_cache
rootdir: /Users/gabe/dev/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, xdist-2.3.0, asyncio-0.12.0, timeout-1.4.2, forked-1.3.0, cov-3.0.0, rerunfailures-10.0
timeout: 300.0s
timeout method: thread
timeout func_only: False
collected 1 item
distributed/tests/test_nanny.py::test_worker_start_exception 2022-05-23 16:11:06,515 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-05-23 16:11:06,519 - distributed.scheduler - INFO - State start
2022-05-23 16:11:06,522 - distributed.scheduler - INFO - Clear task state
2022-05-23 16:11:06,523 - distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:61231
2022-05-23 16:11:06,523 - distributed.scheduler - INFO - dashboard at: 127.0.0.1:61230
2022-05-23 16:11:06,531 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:61232'
2022-05-23 16:11:07,260 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 872, in run
await worker
File "/Users/gabe/dev/distributed/distributed/tests/test_nanny.py", line 482, in start
raise StartException("broken")
test_nanny.StartException: broken
2022-05-23 16:11:07,283 - distributed.nanny - ERROR - Failed while trying to start worker process: broken
2022-05-23 16:11:07,284 - distributed.nanny - ERROR - Failed to connect to process
Traceback (most recent call last):
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
msg = await self._wait_until_connected(uid)
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
raise msg["exception"]
test_nanny.StartException: broken
2022-05-23 16:11:07,285 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 425, in instantiate
result = await self.process.start()
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
msg = await self._wait_until_connected(uid)
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
raise msg["exception"]
test_nanny.StartException: broken
2022-05-23 16:11:07,285 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:61232'.
HERE # <--- 😬😬😬 OUR LOGGING CAPTURE ENDS. `async with` block has exited
Start Nanny at: 'tcp://127.0.0.1:61232'
Failed while trying to start worker process: broken
Failed to connect to process
Traceback (most recent call last):
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
msg = await self._wait_until_connected(uid)
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
raise msg["exception"]
test_nanny.StartException: broken
Failed to start process
Traceback (most recent call last):
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 425, in instantiate
result = await self.process.start()
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 681, in start
msg = await self._wait_until_connected(uid)
File "/Users/gabe/dev/distributed/distributed/nanny.py", line 801, in _wait_until_connected
raise msg["exception"]
test_nanny.StartException: broken
Closing Nanny at 'tcp://127.0.0.1:61232'.
Dumped cluster state to test_cluster_dump/test_worker_start_exception.yaml
2022-05-23 16:11:07,599 - distributed.nanny - INFO - Worker process 37919 was killed by signal 15
2022-05-23 16:11:07,601 - distributed.nanny - WARNING - Restarting worker # <--- 😬😬😬 WORKER IS RESTARTING
2022-05-23 16:11:07,602 - distributed.scheduler - INFO - Scheduler closing...
2022-05-23 16:11:07,612 - distributed.scheduler - INFO - Scheduler closing all comms
FAILED
================================================================================== FAILURES ===================================================================================
_________________________________________________________________________ test_worker_start_exception _________________________________________________________________________
s = <Scheduler 'tcp://127.0.0.1:61231', workers: 0, cores: 0, tasks: 0>
@gen_cluster(nthreads=[])
async def test_worker_start_exception(s):
with captured_logger(logger="distributed.nanny", level=logging.DEBUG) as logs:
# make sure this raises the right Exception:
with raises_with_cause(RuntimeError, None, StartException, None):
async with Nanny(s.address, worker_class=BrokenWorker) as n:
pass
print("HERE")
print(logs.getvalue())
assert "Restarting worker" not in logs.getvalue()
> assert False # fail test so we can see stdout
E assert False
distributed/tests/test_nanny.py:511: AssertionError
============================================================================ slowest 20 durations =============================================================================
1.11s call distributed/tests/test_nanny.py::test_worker_start_exception
(2 durations < 0.005s hidden. Use -vv to show these durations.)
=========================================================================== short test summary info ===========================================================================
FAILED distributed/tests/test_nanny.py::test_worker_start_exception - assert False
============================================================================== 1 failed in 1.37s ==============================================================================
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels