Skip to content

Another deadlock in the preamble of WorkerState.execute #6869

@crusaderky

Description

@crusaderky

This is tightly related with #6867 and dask/dask#9330.

There is a deadlock which is triggered by this code path:

if ts.state == "cancelled":
logger.debug(
"Trying to execute task %s which is not in executing state anymore",
ts,
)
return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id)

which in turn triggers:

@_handle_event.register
def _handle_already_cancelled(self, ev: AlreadyCancelledEvent) -> RecsInstrs:
"""Task is already cancelled by the time execute() runs"""
# key *must* be still in tasks. Releasing it directly is forbidden
# without going through cancelled
ts = self.tasks.get(ev.key)
assert ts, self.story(ev.key)
ts.done = True
return {ts: "released"}, []

The deadlock should be reproducible as follows:

  1. handle_stimulus(ComputeTaskEvent(key="x")
    ts.state=executing; create asyncio task for Worker.execute
  2. handle_stimulus(FreeKeysEvent(keys=["x"])
    ts.state=cancelled
  3. await asyncio.sleep(0)
    Worker.execute runs and returns AlreadyCancelledEvent.
    This causes the _handle_stimulus_from_task callback to be appended to the end of the event loop.
    However, the test suite is before that in the event loop:
  4. handle_stimulus(ComputeTaskEvent(key="x")
    ts.state=resumed
  5. await ... (anything that releases the event loop)
    This runs _handle_stimulus_from_task,
    which runs _handle_already_cancelled,
    which returns {ts: "released"},
    which triggers the (resumed, released) transition,
    which sends the task to cancelled state, while the scheduler thinks it's running.

@fjetter @gjoseph92 my head is spinning.

Metadata

Metadata

Assignees

Labels

deadlockThe cluster appears to not make any progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions