Skip to content

Deadlock stealing a resumed task #6159

@gjoseph92

Description

@gjoseph92
  • Worker starts fetching key abcd from a peer
  • The fetch gets cancelled
  • Scheduler now asks the worker to compute the key abcd
  • Deadlock

Here's the (annotated) worker story for the key in question:

# Initially we hear about the key as a dependency to fetch
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-task-exists
  - released
  - compute-task-1650391863.1634912
  - 2022-04-19 11:11:03.369210
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - released
  - fetch
  - fetch
  - {}
  - compute-task-1650391863.1634912
  - 2022-04-19 11:11:03.369474
- - gather-dependencies
  - tls://10.0.13.152:38353
  - - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
    - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.838874
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - fetch
  - flight
  - flight
  - {}
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.838897
# We go try to fetch it. I think we're talking to a worker that's still alive,
# but so locked up under memory pressure it never responds—see https://github.com/dask/distributed/issues/6110
- - request-dep
  - tls://10.0.13.152:38353
  - - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
    - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - ensure-communicating-1650391867.838159
  - 2022-04-19 11:11:07.848737
# This is really weird. After the stuck worker's TTL expires (https://github.com/dask/distributed/issues/6110#issuecomment-1102959742),
# the scheduler removes it. But the log we're looking at _isn't from the stuck worker_.
# So why is `Scheduler.transition_processing_released` sending a `free-keys` to this worker?
# That would indicate the scheduler's `processing_on` pointed to this worker.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - flight
  - released
  - cancelled
  - {}
  - processing-released-1650392050.5300016
  - 2022-04-19 11:14:10.571274
# Regardless, the scheduler then asks us to compute (not fetch!) this task.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - compute-task
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165346
# We already know about it---the fetch was just cancelled by `processing-released`---so it stays in cancelled,
# with a recommendation to `(resumed, waiting)`
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - cancelled
  - waiting
  - cancelled
  - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1)):
    - resumed
    - waiting
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165884
# It goes into resumed and nothing ever happens again
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
  - cancelled
  - resumed
  - resumed
  - {}
  - compute-task-1650392058.1483586
  - 2022-04-19 11:14:18.165897

There's definitely something weird about the processing-released message arriving right before the compute-task message. I can't find an obvious reason in scheduler code why that would happen.

But let's ignore that oddity for a second. Pretend it was just a normal work-stealing request that caused the task to be cancelled.

I find it odd that if a worker is told to compute a task it was previously fetching, that it'll resume the fetch:

else:
assert ts._previous == "flight"
return {ts: ("resumed", "waiting")}, []

If previously we were fetching a key, but now we're being asked to compute it, it seems almost certain that the fetch is going to fail. The compute request should probably take precedence.

I imagine here that we're assuming the gather_dep will error out sometime in the future, and when it does, then the key will go from resumed to waiting?

Also, this is coming from the #6110 scenario. That's an unusual one in that the TCP connection to the stuck worker doesn't get broken, it's just unresponsive. So I'm also wondering if perhaps gather_dep to the stuck worker will hang forever? for 300s (seems to go much longer than that)? for 300s * some retries? Basically, could it be that this isn't quite a deadlock, but a very, very, very long wait for a dependency fetch that might never return until the other worker properly dies? If we don't have any explicit timeouts on gather_dep already, maybe we should.
(All that said, I still think the proper fix would be to not have transition_cancelled_waiting try to resume the fetch, but instead go down the compute path. The timeout might be something in addition.)

cc @fjetter

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokendeadlockThe cluster appears to not make any progressstabilityIssue or feature related to cluster stability (e.g. deadlock)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions