Deadlock when emerging from closing_gracefully

This line at the top of `Worker.execute` is wrong:
https://github.com/dask/distributed/blob/e1b9e20fde946194858165a8b91cb94703c715aa/distributed/worker.py#L2144-L2145

The problem is that `closing_gracefully` is reversible. This *normally* doesn't happen. However, there are legitimate use cases where `Schedule.retire_workers` can give up and revert a worker from `closing_gracefully` back to running - namely, if there are no longer any peer workers anymore that can accept its unique in-memory tasks.

This can cause a rather extreme race condition where 

1. The worker receives a `{op: compute-task}` command followed, *within the same batched-send packet*, by `{op: worker-status-change, status: closing_gracefully}`.
2. This will cause a `Worker.execute` asyncio task to be spawned and, as soon as it reaches its turn in the event loop, return None.
3. The task is now stuck in `running` state forever. This is not a problem for `closing` and `closed`, as we're irreversibly tearing down everything anyways.
4. However, later on the scheduler decides to resuscitate the worker: `{op: worker-status-change, status: running}`.
5. The scheduler and the WorkerState now both think that the task is running, but it's not.

The fix for this issue is trivial (just remove `closing-gracefully` from the line above); a deterministic reproducer is probably going to be very ugly.

This issue interacts with #3761.

FYI @fjetter 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock when emerging from closing_gracefully #6867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if self.status in {Status.closing, Status.closed, Status.closing_gracefully}:
	return None

Uh oh!

Deadlock when emerging from closing_gracefully #6867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions