-
-
Notifications
You must be signed in to change notification settings - Fork 757
Closed
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress
Description
This line at the top of Worker.execute is wrong:
distributed/distributed/worker.py
Lines 2144 to 2145 in e1b9e20
| if self.status in {Status.closing, Status.closed, Status.closing_gracefully}: | |
| return None |
The problem is that closing_gracefully is reversible. This normally doesn't happen. However, there are legitimate use cases where Schedule.retire_workers can give up and revert a worker from closing_gracefully back to running - namely, if there are no longer any peer workers anymore that can accept its unique in-memory tasks.
This can cause a rather extreme race condition where
- The worker receives a
{op: compute-task}command followed, within the same batched-send packet, by{op: worker-status-change, status: closing_gracefully}. - This will cause a
Worker.executeasyncio task to be spawned and, as soon as it reaches its turn in the event loop, return None. - The task is now stuck in
runningstate forever. This is not a problem forclosingandclosed, as we're irreversibly tearing down everything anyways. - However, later on the scheduler decides to resuscitate the worker:
{op: worker-status-change, status: running}. - The scheduler and the WorkerState now both think that the task is running, but it's not.
The fix for this issue is trivial (just remove closing-gracefully from the line above); a deterministic reproducer is probably going to be very ugly.
This issue interacts with #3761.
FYI @fjetter
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress