-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
Client.restart() immediately removes WorkerStates before the workers have fully shut down (#6390) by calling remove_worker. However, this doesn't flush the BatchedSend or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after the WorkerState has been removed, but before the op: close has reached the worker, the scheduler will get mad at it and the worker will shut itself down instead of restarting. Only after it starts to shut down will it receive the op: close, nanny=True message from the scheduler, which it will effectively ignore.
There are a few ways to address this:
- short-term: the worker should restart when it gets the
missingmessage instead of shutting down. This is reasonable to do anyway. Restart worker via Nanny on connection failure #6387 - medium-term: Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390
@jrbourbeau discovered this, and we initially thought it was an issue with the fact that Worker.close awaits a bunch of things before turning off its heartbeats to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies with missing, then the Worker.close() call in response to that will just jump on the bandwagon of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.
Therefore, I think this is purely about the race condition on the scheduler.
cc @fjetter