Skip to content

Deadlock when emerging from closing_gracefully #6867

@crusaderky

Description

@crusaderky

This line at the top of Worker.execute is wrong:

if self.status in {Status.closing, Status.closed, Status.closing_gracefully}:
return None

The problem is that closing_gracefully is reversible. This normally doesn't happen. However, there are legitimate use cases where Schedule.retire_workers can give up and revert a worker from closing_gracefully back to running - namely, if there are no longer any peer workers anymore that can accept its unique in-memory tasks.

This can cause a rather extreme race condition where

  1. The worker receives a {op: compute-task} command followed, within the same batched-send packet, by {op: worker-status-change, status: closing_gracefully}.
  2. This will cause a Worker.execute asyncio task to be spawned and, as soon as it reaches its turn in the event loop, return None.
  3. The task is now stuck in running state forever. This is not a problem for closing and closed, as we're irreversibly tearing down everything anyways.
  4. However, later on the scheduler decides to resuscitate the worker: {op: worker-status-change, status: running}.
  5. The scheduler and the WorkerState now both think that the task is running, but it's not.

The fix for this issue is trivial (just remove closing-gracefully from the line above); a deterministic reproducer is probably going to be very ugly.

This issue interacts with #3761.

FYI @fjetter

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions