Deadlock: tasks stolen to old `WorkerState` instance of a reconnected worker

Task stealing keeps references to `WorkerState` objects: https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L245-L251

If a worker disconnects, then reconnects from the address (it restarts, or the shutdown-reconnect bug happens https://github.com/dask/distributed/issues/6354), task stealing can hold a reference to the old `WorkerState` object for that address, while the scheduler is working with the new `WorkerState` object for that address.

If tasks are assigned to this old, stale `WorkerState`, and then the worker leaves, the tasks will be forever stuck in processing (because they're not recognized as being on the worker that just left).

<details><summary>Full trace-through</summary>

Copied from https://github.com/dask/distributed/issues/6263#issuecomment-1128390113

1. Stealing decides to move a task to worker X.
    1. It [queues a `steal-request`](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L242-L244) to worker Y (where the task is currently queued), asking it to cancel the task.
    1. [Stores a reference](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L245-L247) to the `victim` and `thief` `WorkerState`s (not addresses) in `WorkStealing.in_flight`
2. Worker X gets removed by the scheduler.
3. Its `WorkerState` instance—the one currently referenced `WorkStealing.in_flight`—is removed from `Scheduler.workers`.
4. Worker X heartbeats to the scheduler, reconnecting (bug described above).
4. A _new_ `WorkerState` instance for it is added to `Scheduler.workers`, at the same address. The scheduler thinks nothing is processing on it.
5. Worker Y finally replies, "hey yeah, it's all cool if you steal that task".
6. `move_task_confirm` handles this, and [pops info](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L275) about the stealing operation from `WorkStealing.in_flight`.
7. This info contains a [reference to the `thief`](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L284) `WorkerState` object. This is the old `WorkerState` instance, which is no longer in `Scheduler.workers`.
8. The `thief`'s [_address_ is in `scheduler.workers`](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L308), even though the `theif` object isn't.
9. The task [gets assigned](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L331) to a worker that, to the scheduler, no longer exists.
10. When worker X actually shuts itself down, `Scheduler.remove_worker` [goes to reschedule any tasks it's processing](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/scheduler.py#L4240-L4242). But it's looking at the new `WorkerState` instance, and the task was assigned to the old one, so the task is never rescheduled.

---------

</details>

I think work stealing should either:
* Store the address and `id` of the `WorkerState` instance, instead of a direct reference. Verify that `id(scheduler.workers[addr]) == expected_id`. An advantage is that this avoids reference leaks of `WorkerState` objects (though they should [eventually be cleaned up](https://github.com/dask/distributed/blob/b8b45c6e641f298f465d47139f1fb297c4a5799a/distributed/stealing.py#L144) when the task completes) https://github.com/dask/distributed/issues/6250.
* Just verify that `d["thief"] is self.scheduler.workers[theif.address]`

In combination with https://github.com/dask/distributed/issues/6354, causes https://github.com/dask/distributed/issues/6263, https://github.com/dask/distributed/issues/6198.

	self.in_flight[ts] = {
	"victim": victim, # guaranteed to be processing_on
	"thief": thief,
	"victim_duration": victim_duration,
	"thief_duration": thief_duration,
	"stimulus_id": stimulus_id,
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock: tasks stolen to old `WorkerState` instance of a reconnected worker #6356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Deadlock: tasks stolen to old WorkerState instance of a reconnected worker #6356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Deadlock: tasks stolen to old `WorkerState` instance of a reconnected worker #6356