Deadlock stealing a `resumed` task

* Worker starts fetching key `abcd` from a peer
* The fetch gets cancelled
* Scheduler now asks the worker to _compute_ the key `abcd`
* Deadlock

Here's the (annotated) worker story for the key in question:
```yaml
# Initially we hear about the key as a dependency to fetch
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-task-exists
  - released
  - compute-task-1650391863.1634912
  - 2022-04-19 11:11:03.369210
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - released
  - fetch
  - fetch
  - {}
  - compute-task-1650391863.1634912
  - 2022-04-19 11:11:03.369474
- - gather-dependencies
  - tls://10.0.13.152:38353
  - - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
    - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.838874
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - fetch
  - flight
  - flight
  - {}
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.838897
# We go try to fetch it. I think we're talking to a worker that's still alive,
# but so locked up under memory pressure it never responds—see https://github.com/dask/distributed/issues/6110
- - request-dep
  - tls://10.0.13.152:38353
  - - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
    - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.848737
# This is really weird. After the stuck worker's TTL expires (https://github.com/dask/distributed/issues/6110#issuecomment-1102959742),
# the scheduler removes it. But the log we're looking at _isn't from the stuck worker_.
# So why is `Scheduler.transition_processing_released` sending a `free-keys` to this worker?
# That would indicate the scheduler's `processing_on` pointed to this worker.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - flight
  - released
  - cancelled
  - {}
  - processing-released-1650392050.5300016
  - 2022-04-19 11:14:10.571274
# Regardless, the scheduler then asks us to compute (not fetch!) this task.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - compute-task
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165346
# We already know about it---the fetch was just cancelled by `processing-released`---so it stays in cancelled,
# with a recommendation to `(resumed, waiting)`
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - cancelled
  - waiting
  - cancelled
  - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1)):
    - resumed
    - waiting
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165884
# It goes into resumed and nothing ever happens again
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - cancelled
  - resumed
  - resumed
  - {}
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165897
```

There's definitely something weird about the `processing-released` message arriving right before the `compute-task` message. I can't find an obvious reason in scheduler code why that would happen.

But let's ignore that oddity for a second. Pretend it was just a normal work-stealing request that caused the task to be cancelled.

I find it odd that if a worker is told to _compute_ a task it was previously fetching, that it'll resume the fetch:
https://github.com/dask/distributed/blob/c9dcbe7ee87be83fde1156f18e88ebe2da992c0c/distributed/worker.py#L2269-L2271

If previously we were fetching a key, but now we're being asked to compute it, it seems almost certain that the fetch is going to fail. The compute request should probably take precedence.

I imagine here that we're assuming the `gather_dep` will error out sometime in the future, and when it does, _then_ the key will go from `resumed` to `waiting`?

Also, this is coming from the https://github.com/dask/distributed/issues/6110 scenario. That's an unusual one in that the TCP connection to the stuck worker doesn't get broken, it's just unresponsive. So I'm also wondering if perhaps `gather_dep` to the stuck worker will hang forever? for 300s (seems to go much longer than that)? for 300s * some retries? Basically, could it be that this isn't _quite_ a deadlock, but a very, very, very long wait for a dependency fetch that might never return until the other worker properly dies? If we don't have any explicit timeouts on `gather_dep` already, maybe we should.
(All that said, I still think the proper fix would be to not have `transition_cancelled_waiting` try to resume the fetch, but instead go down the compute path. The timeout might be something in addition.)

cc @fjetter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock stealing a `resumed` task #6159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	else:
	assert ts._previous == "flight"
	return {ts: ("resumed", "waiting")}, []

Uh oh!

Deadlock stealing a resumed task #6159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Deadlock stealing a `resumed` task #6159