Skip to content

Deadlock when a flaky task is stolen #6689

@crusaderky

Description

@crusaderky
  1. Task x is running on worker A
  2. x is stolen to worker B
  3. x completes successfully on worker B
  4. y, which depends on x, is scheduled on worker A
  5. x eventually completes on worker A, raising an exception. To reiterate: x is a task that does not behave deterministically, and/or changes behaviour depending on which worker it's executed on.

Expected behaviour

Either of these is acceptable:

  • the entire computation falls over, re-raising the exception on the client
  • worker A logs the incident, without informing the scheduler, and fetches the task from worker B

Actual behaviour

Cluster deadlock; corrupted worker state

Reproducer

def test_worker_state_executing_failure_to_fetch(ws_with_running_task):
    ws = ws_with_running_task
    ws2 = "127.0.0.1:2"
    instructions = ws.handle_stimulus(
        FreeKeysEvent(keys=["x"], stimulus_id="s1"),
        ComputeTaskEvent.dummy(key="y", who_has={"x": [ws2]}, stimulus_id="s2"),
        ExecuteFailureEvent(
            key="x",
            start=0.0,
            stop=1.0,
            exception=Serialize(Exception()),
            traceback=None,
            exception_text="",
            traceback_text="",
            stimulus_id="s3",
        ),
    )
    assert instructions == [
        GatherDep(worker=ws2, to_gather={"x"}, total_nbytes=1, stimulus_id="s3")
    ]
    assert ws.tasks["x"].state == "flight"

Output:

Expected :[GatherDep(stimulus_id='s3', worker='127.0.0.1:2', to_gather={'x'}, total_nbytes=1)]
Actual   :[TaskErredMsg(stimulus_id='s3', key='x', ...)]

            for ts_wait in ts.waiting_for_data:
                assert ts_wait.key in self.tasks
>               assert (
                    ts_wait.state in READY | {"executing", "flight", "fetch", "missing"}
                    or ts_wait in self.missing_dep_flight
                    or ts_wait.who_has.issubset(self.in_flight_workers)
                ), (ts, ts_wait, self.story(ts), self.story(ts_wait))
E               AssertionError: (<TaskState 'y' waiting>, <TaskState 'x' error>, [('y', 'compute-task', 'released', 's2', 1657213114.0749152), ('y', 'released', 'waiting', 'waiting', {'x': 'fetch'}, 's2', 1657213114.0749364)], [('x', 'compute-task', 'released', 'compute', 1657213114.0746293), ('x', 'released', 'waiting', 'waiting', {'x': 'constrained'}, 'compute', 1657213114.0746458), ('x', 'waiting', 'constrained', 'constrained', {'x': 'executing'}, 'compute', 1657213114.0746567), ('x', 'constrained', 'executing', 'executing', {}, 'compute', 1657213114.074662), ('free-keys', ['x'], 's1', 1657213114.0748878), ('x', 'executing', 'released', 'cancelled', {}, 's1', 1657213114.0748944), ('x', 'ensure-task-exists', 'cancelled', 's2', 1657213114.0749264), ('x', 'cancelled', 'fetch', 'resumed', {}, 's2', 1657213114.0749395), ('x', 'resumed', 'error', 'error', {}, 's3', 1657213114.0749602)])

../worker_state_machine.py:3081: AssertionError

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions