`client.restart()` may cause workers to shut down instead of restarting

`Client.restart()` immediately removes WorkerStates before the workers have fully shut down (https://github.com/dask/distributed/issues/6390) [by calling `remove_worker`](https://github.com/dask/distributed/blob/a3414329458e0bb1eb7efa9d948dfe1c91d53bf9/distributed/scheduler.py#L5096-L5106). However, this doesn't flush the `BatchedSend` or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after the `WorkerState` has been removed, but before the `op: close` has reached the worker, the scheduler will [get mad at it](https://github.com/dask/distributed/blob/a3414329458e0bb1eb7efa9d948dfe1c91d53bf9/distributed/scheduler.py#L3463-L3466) and the worker will [shut itself down](https://github.com/dask/distributed/blob/a3414329458e0bb1eb7efa9d948dfe1c91d53bf9/distributed/worker.py#L1239-L1245) instead of restarting. Only after it starts to shut down will it receive the `op: close, nanny=True` message from the scheduler, which it will effectively ignore.

There are a few ways to address this:
- short-term: the worker should restart when it gets the `missing` message instead of shutting down. This is reasonable to do anyway. https://github.com/dask/distributed/issues/6387
- medium-term: https://github.com/dask/distributed/issues/6390

@jrbourbeau discovered this, and we initially thought it was an issue with the fact that `Worker.close` `await`s a bunch of things before [turning off its heartbeats](https://github.com/dask/distributed/blob/a3414329458e0bb1eb7efa9d948dfe1c91d53bf9/distributed/worker.py#L1537-L1538) to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies with `missing`, then the `Worker.close()` call in response to that will just [jump on the bandwagon](https://github.com/dask/distributed/blob/a3414329458e0bb1eb7efa9d948dfe1c91d53bf9/distributed/worker.py#L1473-L1475) of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.

Therefore, I think this is purely about the race condition on the scheduler.

cc @fjetter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`client.restart()` may cause workers to shut down instead of restarting #6494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

client.restart() may cause workers to shut down instead of restarting #6494

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`client.restart()` may cause workers to shut down instead of restarting #6494