Skip to content

client.restart() may cause workers to shut down instead of restarting #6494

@gjoseph92

Description

@gjoseph92

Client.restart() immediately removes WorkerStates before the workers have fully shut down (#6390) by calling remove_worker. However, this doesn't flush the BatchedSend or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after the WorkerState has been removed, but before the op: close has reached the worker, the scheduler will get mad at it and the worker will shut itself down instead of restarting. Only after it starts to shut down will it receive the op: close, nanny=True message from the scheduler, which it will effectively ignore.

There are a few ways to address this:

@jrbourbeau discovered this, and we initially thought it was an issue with the fact that Worker.close awaits a bunch of things before turning off its heartbeats to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies with missing, then the Worker.close() call in response to that will just jump on the bandwagon of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.

Therefore, I think this is purely about the race condition on the scheduler.

cc @fjetter

Metadata

Metadata

Assignees

Labels

bugSomething is brokenstabilityIssue or feature related to cluster stability (e.g. deadlock)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions