-
-
Notifications
You must be signed in to change notification settings - Fork 757
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress
Description
- Task x is running on worker A
- x is stolen to worker B
- x completes successfully on worker B
- y, which depends on x, is scheduled on worker A
- x eventually completes on worker A, raising an exception. To reiterate: x is a task that does not behave deterministically, and/or changes behaviour depending on which worker it's executed on.
Expected behaviour
Either of these is acceptable:
- the entire computation falls over, re-raising the exception on the client
- worker A logs the incident, without informing the scheduler, and fetches the task from worker B
Actual behaviour
Cluster deadlock; corrupted worker state
Reproducer
def test_worker_state_executing_failure_to_fetch(ws_with_running_task):
ws = ws_with_running_task
ws2 = "127.0.0.1:2"
instructions = ws.handle_stimulus(
FreeKeysEvent(keys=["x"], stimulus_id="s1"),
ComputeTaskEvent.dummy(key="y", who_has={"x": [ws2]}, stimulus_id="s2"),
ExecuteFailureEvent(
key="x",
start=0.0,
stop=1.0,
exception=Serialize(Exception()),
traceback=None,
exception_text="",
traceback_text="",
stimulus_id="s3",
),
)
assert instructions == [
GatherDep(worker=ws2, to_gather={"x"}, total_nbytes=1, stimulus_id="s3")
]
assert ws.tasks["x"].state == "flight"Output:
Expected :[GatherDep(stimulus_id='s3', worker='127.0.0.1:2', to_gather={'x'}, total_nbytes=1)]
Actual :[TaskErredMsg(stimulus_id='s3', key='x', ...)]
for ts_wait in ts.waiting_for_data:
assert ts_wait.key in self.tasks
> assert (
ts_wait.state in READY | {"executing", "flight", "fetch", "missing"}
or ts_wait in self.missing_dep_flight
or ts_wait.who_has.issubset(self.in_flight_workers)
), (ts, ts_wait, self.story(ts), self.story(ts_wait))
E AssertionError: (<TaskState 'y' waiting>, <TaskState 'x' error>, [('y', 'compute-task', 'released', 's2', 1657213114.0749152), ('y', 'released', 'waiting', 'waiting', {'x': 'fetch'}, 's2', 1657213114.0749364)], [('x', 'compute-task', 'released', 'compute', 1657213114.0746293), ('x', 'released', 'waiting', 'waiting', {'x': 'constrained'}, 'compute', 1657213114.0746458), ('x', 'waiting', 'constrained', 'constrained', {'x': 'executing'}, 'compute', 1657213114.0746567), ('x', 'constrained', 'executing', 'executing', {}, 'compute', 1657213114.074662), ('free-keys', ['x'], 's1', 1657213114.0748878), ('x', 'executing', 'released', 'cancelled', {}, 's1', 1657213114.0748944), ('x', 'ensure-task-exists', 'cancelled', 's2', 1657213114.0749264), ('x', 'cancelled', 'fetch', 'resumed', {}, 's2', 1657213114.0749395), ('x', 'resumed', 'error', 'error', {}, 's3', 1657213114.0749602)])
../worker_state_machine.py:3081: AssertionError
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
deadlockThe cluster appears to not make any progressThe cluster appears to not make any progress