Restart can kill a worker

When restarting a cluster of Nannies we expect every worker to be restarted and connect to the worker.

There appears to exist a race condition where the worker instead closes for good and tears down the Nanny with it.

This has been frequently observed in the coiled-runtime benchmarks, see https://github.com/coiled/coiled-runtime/issues/468

Some sample logs (I added line numbers to reference logs below)


```
1. 2022-11-09 22:16:04,681 - distributed.nanny - INFO -         Start Nanny at: 'tls://10.0.6.215:45015'
2. 2022-11-09 22:16:05,304 - distributed.worker - INFO -       Start worker at:     tls://10.0.6.215:41071
3. 2022-11-09 22:16:05,305 - distributed.worker - INFO -          Listening to:     tls://10.0.6.215:41071
4. 2022-11-09 22:16:05,305 - distributed.worker - INFO -           Worker name: test_dataframe-86e8569e-worker-012275ba16
5. 2022-11-09 22:16:05,306 - distributed.worker - INFO -          dashboard at:            10.0.6.215:8787
6. 2022-11-09 22:16:05,306 - distributed.worker - INFO - Waiting to connect to:       tls://10.0.5.46:8786
7. 2022-11-09 22:16:05,306 - distributed.worker - INFO - -------------------------------------------------
8. 2022-11-09 22:16:05,306 - distributed.worker - INFO -               Threads:                          2
9. 2022-11-09 22:16:05,306 - distributed.worker - INFO -                Memory:                   7.16 GiB
10. 2022-11-09 22:16:05,306 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-p5w95n2u
11. 2022-11-09 22:16:05,306 - distributed.worker - INFO - -------------------------------------------------
12. 2022-11-09 22:16:05,607 - distributed.worker - INFO - Starting Worker plugin coiled-aws-spot
13. 2022-11-09 22:16:05,607 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
14. 2022-11-09 22:16:05,607 - distributed.worker - INFO - coiled updating aws sts token environment variables
15. 2022-11-09 22:16:05,933 - distributed.worker - INFO - coiled cleared s3fs/boto session
16. 2022-11-09 22:16:05,933 - distributed.worker - INFO -         Registered to:       tls://10.0.5.46:8786
17. 2022-11-09 22:16:05,933 - distributed.worker - INFO - -------------------------------------------------
18. 2022-11-09 22:16:05,934 - distributed.core - INFO - Starting established connection to tls://10.0.5.46:8786
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  First Worker up and in Worker.handle_scheduler ^^^^^^^^^^^^^^^^^^^^^^^^^^
19. 2022-11-09 22:16:09,480 - distributed.dask_worker - INFO - Not all prometheus metrics available are exported. Digest-based metrics require crick to be installed
20. 2022-11-09 22:16:41,493 - distributed.utils_perf - INFO - full garbage collection released 55.36 MiB from 403 reference cycles (threshold: 9.54 MiB)
21. 2022-11-09 22:17:07,596 - distributed.nanny - INFO - Nanny asking worker to close. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  First Restart ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22. 2022-11-09 22:17:07,597 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41071. Reason: scheduler-restart
23. 2022-11-09 22:17:07,598 - distributed.core - INFO - Connection to tls://10.0.5.46:8786 has been closed.
24. 2022-11-09 22:17:07,601 - distributed.nanny - INFO - Worker closed
# ^^^^^^^^^^^^^^^^ First Worker Process closed triggering worker_on_exit callback to restart ^^^^^^^^^^^^^^
25. 2022-11-09 22:17:07,857 - distributed.nanny - WARNING - Restarting worker
26. 2022-11-09 22:17:08,549 - distributed.worker - INFO -       Start worker at:     tls://10.0.6.215:41295
27. 2022-11-09 22:17:08,549 - distributed.worker - INFO -          Listening to:     tls://10.0.6.215:41295
28. 2022-11-09 22:17:08,549 - distributed.worker - INFO -           Worker name: test_dataframe-86e8569e-worker-012275ba16
29. 2022-11-09 22:17:08,549 - distributed.worker - INFO -          dashboard at:            10.0.6.215:8787
30. 2022-11-09 22:17:08,549 - distributed.worker - INFO - Waiting to connect to:       tls://10.0.5.46:8786
31. 2022-11-09 22:17:08,550 - distributed.worker - INFO - -------------------------------------------------
32. 2022-11-09 22:17:08,550 - distributed.worker - INFO -               Threads:                          2
33. 2022-11-09 22:17:08,550 - distributed.worker - INFO -                Memory:                   7.16 GiB
34. 2022-11-09 22:17:08,550 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-k7maxf3u
35. 2022-11-09 22:17:08,551 - distributed.worker - INFO - -------------------------------------------------
36. 2022-11-09 22:17:09,076 - distributed.worker - INFO - Starting Worker plugin coiled-aws-spot
37. 2022-11-09 22:17:09,076 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
38. 2022-11-09 22:17:09,076 - distributed.worker - INFO - coiled updating aws sts token environment variables
39. 2022-11-09 22:17:09,545 - distributed.worker - INFO - coiled cleared s3fs/boto session
40. 2022-11-09 22:17:09,546 - distributed.worker - INFO -         Registered to:       tls://10.0.5.46:8786
# ^^^^^^^^^^^^^ Second worker established connection to Scheduler but not in handle_scheduler yet ^^^^^^^^^^^
41. 2022-11-09 22:17:09,550 - distributed.worker - INFO - -------------------------------------------------
43. 2022-11-09 22:17:09,551 - distributed.nanny - INFO - Nanny asking worker to close. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  Second Restart ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
44. 2022-11-09 22:17:09,597 - distributed.dask_worker - INFO - Not all prometheus metrics available are exported. Digest-based metrics require crick to be installed
45. 2022-11-09 22:17:09,604 - distributed.core - INFO - Starting established connection to tls://10.0.5.46:8786
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.handle_scheduler ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46. 2022-11-09 22:17:09,604 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41295. Reason: scheduler-restart
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.close ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
47. 2022-11-09 22:17:09,606 - distributed.core - INFO - Connection to tls://10.0.5.46:8786 has been closed.
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Second worker in Worker.handle_scheduler encounters CommClosed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48. 2022-11-09 22:17:09,607 - distributed.worker - INFO - Stopping worker at tls://10.0.6.215:41295. Reason: worker-handle-scheduler-connection-broken
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ CommClosed triggers another close with nanny=True which is not waiting for the first one ^^^^^^^^^^^^^^^^^^^^^^^
49. 2022-11-09 22:17:09,614 - distributed.nanny - INFO - Closing Nanny gracefully at 'tls://10.0.6.215:45015'. Reason: worker-handle-scheduler-connection-broken
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Nanny is closed as well, such that nothing is restarted ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
50. 2022-11-09 22:17:09,615 - distributed.nanny - INFO - Worker closed
51. 2022-11-09 22:17:09,679 - distributed.worker - ERROR -
52. Traceback (most recent call last):
53.   File "/opt/coiled/env/lib/python3.9/site-packages/distributed/utils.py", line 742, in wrapper
54.     return await func(*args, **kwargs)
55.   File "/opt/coiled/env/lib/python3.9/site-packages/distributed/worker.py", line 1618, in close
56.     await to_thread(_close, executor=executor, wait=executor_wait)
57.   File "/opt/coiled/env/lib/python3.9/asyncio/threads.py", line 25, in to_thread
58.     return await loop.run_in_executor(None, func_call)
59. asyncio.exceptions.CancelledError
# ^^^^^^^^^^^^^^^^^^ One of the two close requests is trying to shut down the threadpool but this hangs for some reason. Related? Who is cancelling ^^^^^^^^^^^^^^
60. 2022-11-09 22:17:09,814 - distributed.nanny - INFO - Closing Nanny at 'tls://10.0.6.215:45015'. Reason: nanny-close-gracefully
61. 2022-11-09 22:17:09,814 - distributed.preloading - INFO - Run preload teardown: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
```

What this translate to is
- L1: Nanny is started.
- L2-L19: First worker is started + connected.
- L21-L25: First worker is restarted
- L26-L41 + L45: Second worker is starting
              ^^^^ Note the interleaving with subsequent restart
- L43: Before the start is completed, `Nanny` a restart already 
- L46: Worker enters `Worker.close(nanny=False)` for the first time
- Worker likely sends a `stream-close` to the scheduler (TODO: verify if possible) such that the batched stream is aborted
- L47: `Worker.handle_stream` encounters an exception due to the closed stream
- L48ff: Worker initiates another `Worker.close(nanny=True)` due to the severed connection. This should wait for the first close to terminate and do nothing. However, it initiates another close procedure, this time with the nanny which closes the server for good

What I do not understand is why the second close is not waiting for `Server.finished` in https://github.com/dask/distributed/blob/ef13425ff77a0ae3ec14e0cb3bc6c3a87c363dd6/distributed/worker.py#L1491-L1498
This can only happen if the status after the first close (L46) is altered. As far as I can tell, all status mutations trigger a log message (either on scheduler or worker side) but I couldn't find anything.

The entire thing is over in slightly under 2s.

	if self.status in (Status.closed, Status.closing, Status.failed):
	logging.debug(
	"Attempted to close worker that is already %s. Reason: %s",
	self.status,
	reason,
	)
	await self.finished()
	return None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restart can kill a worker #7312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Restart can kill a worker #7312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions