-
-
Notifications
You must be signed in to change notification settings - Fork 756
Description
- Worker starts fetching key
abcdfrom a peer - The fetch gets cancelled
- Scheduler now asks the worker to compute the key
abcd - Deadlock
Here's the (annotated) worker story for the key in question:
# Initially we hear about the key as a dependency to fetch
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- ensure-task-exists
- released
- compute-task-1650391863.1634912
- 2022-04-19 11:11:03.369210
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- released
- fetch
- fetch
- {}
- compute-task-1650391863.1634912
- 2022-04-19 11:11:03.369474
- - gather-dependencies
- tls://10.0.13.152:38353
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
- ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- ensure-communicating-1650391867.838159
- 2022-04-19 11:11:07.838874
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- fetch
- flight
- flight
- {}
- ensure-communicating-1650391867.838159
- 2022-04-19 11:11:07.838897
# We go try to fetch it. I think we're talking to a worker that's still alive,
# but so locked up under memory pressure it never responds—see https://github.com/dask/distributed/issues/6110
- - request-dep
- tls://10.0.13.152:38353
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 8))
- ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- ensure-communicating-1650391867.838159
- 2022-04-19 11:11:07.848737
# This is really weird. After the stuck worker's TTL expires (https://github.com/dask/distributed/issues/6110#issuecomment-1102959742),
# the scheduler removes it. But the log we're looking at _isn't from the stuck worker_.
# So why is `Scheduler.transition_processing_released` sending a `free-keys` to this worker?
# That would indicate the scheduler's `processing_on` pointed to this worker.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- flight
- released
- cancelled
- {}
- processing-released-1650392050.5300016
- 2022-04-19 11:14:10.571274
# Regardless, the scheduler then asks us to compute (not fetch!) this task.
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- compute-task
- compute-task-1650392058.1483586
- 2022-04-19 11:14:18.165346
# We already know about it---the fetch was just cancelled by `processing-released`---so it stays in cancelled,
# with a recommendation to `(resumed, waiting)`
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- cancelled
- waiting
- cancelled
- ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1)):
- resumed
- waiting
- compute-task-1650392058.1483586
- 2022-04-19 11:14:18.165884
# It goes into resumed and nothing ever happens again
- - ('split-shuffle-1-9698a72b7e1aad6e40a6567d3c06d355', 5, (7, 1))
- cancelled
- resumed
- resumed
- {}
- compute-task-1650392058.1483586
- 2022-04-19 11:14:18.165897There's definitely something weird about the processing-released message arriving right before the compute-task message. I can't find an obvious reason in scheduler code why that would happen.
But let's ignore that oddity for a second. Pretend it was just a normal work-stealing request that caused the task to be cancelled.
I find it odd that if a worker is told to compute a task it was previously fetching, that it'll resume the fetch:
distributed/distributed/worker.py
Lines 2269 to 2271 in c9dcbe7
| else: | |
| assert ts._previous == "flight" | |
| return {ts: ("resumed", "waiting")}, [] |
If previously we were fetching a key, but now we're being asked to compute it, it seems almost certain that the fetch is going to fail. The compute request should probably take precedence.
I imagine here that we're assuming the gather_dep will error out sometime in the future, and when it does, then the key will go from resumed to waiting?
Also, this is coming from the #6110 scenario. That's an unusual one in that the TCP connection to the stuck worker doesn't get broken, it's just unresponsive. So I'm also wondering if perhaps gather_dep to the stuck worker will hang forever? for 300s (seems to go much longer than that)? for 300s * some retries? Basically, could it be that this isn't quite a deadlock, but a very, very, very long wait for a dependency fetch that might never return until the other worker properly dies? If we don't have any explicit timeouts on gather_dep already, maybe we should.
(All that said, I still think the proper fix would be to not have transition_cancelled_waiting try to resume the fetch, but instead go down the compute path. The timeout might be something in addition.)
cc @fjetter