Skip to content

Deadlock: tasks stolen to old WorkerState instance of a reconnected worker #6356

@gjoseph92

Description

@gjoseph92

Task stealing keeps references to WorkerState objects:

self.in_flight[ts] = {
"victim": victim, # guaranteed to be processing_on
"thief": thief,
"victim_duration": victim_duration,
"thief_duration": thief_duration,
"stimulus_id": stimulus_id,
}

If a worker disconnects, then reconnects from the address (it restarts, or the shutdown-reconnect bug happens #6354), task stealing can hold a reference to the old WorkerState object for that address, while the scheduler is working with the new WorkerState object for that address.

If tasks are assigned to this old, stale WorkerState, and then the worker leaves, the tasks will be forever stuck in processing (because they're not recognized as being on the worker that just left).

Full trace-through

Copied from #6263 (comment)

  1. Stealing decides to move a task to worker X.
    1. It queues a steal-request to worker Y (where the task is currently queued), asking it to cancel the task.
    2. Stores a reference to the victim and thief WorkerStates (not addresses) in WorkStealing.in_flight
  2. Worker X gets removed by the scheduler.
  3. Its WorkerState instance—the one currently referenced WorkStealing.in_flight—is removed from Scheduler.workers.
  4. Worker X heartbeats to the scheduler, reconnecting (bug described above).
  5. A new WorkerState instance for it is added to Scheduler.workers, at the same address. The scheduler thinks nothing is processing on it.
  6. Worker Y finally replies, "hey yeah, it's all cool if you steal that task".
  7. move_task_confirm handles this, and pops info about the stealing operation from WorkStealing.in_flight.
  8. This info contains a reference to the thief WorkerState object. This is the old WorkerState instance, which is no longer in Scheduler.workers.
  9. The thief's address is in scheduler.workers, even though the theif object isn't.
  10. The task gets assigned to a worker that, to the scheduler, no longer exists.
  11. When worker X actually shuts itself down, Scheduler.remove_worker goes to reschedule any tasks it's processing. But it's looking at the new WorkerState instance, and the task was assigned to the old one, so the task is never rescheduled.

I think work stealing should either:

In combination with #6354, causes #6263, #6198.

Metadata

Metadata

Assignees

Labels

bugSomething is brokendeadlockThe cluster appears to not make any progressstabilityIssue or feature related to cluster stability (e.g. deadlock)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions