-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
When a task execution is cancelled, the scheduler releases all of the task's dependencies, unless they have other dependents or waiters on the scheduler side.
On the worker, this causes the task to transition from executing to cancelled, and its dependencies to immediately transition from memory to released.
There are problems with this:
- those dependencies will remain in memory until execute() complete, but won't show up anywhere - in other words, they'll become unmanaged memory
- the dependencies will remain released if the task transitions back to resumed or to executing
This use case is virtually untested. We had for a long time these lines in validate_task_executing:
distributed/distributed/worker_state_machine.py
Lines 3095 to 3097 in 3647cfe
| for dep in ts.dependencies: | |
| assert dep.state == "memory", self.story(dep) | |
| assert dep.key in self.data or dep.key in self.actors |
This assertion will fail in case of a executing -> cancelled -> executing cycle, but no unit tests ever tripped it.
The assertion was commented out in #6699. The PR originally tried extending it to the cancelled state, with the result that one (1!) test became flaky,
test_cancel_fire_and_forget.
This also causes a race condition in the preamble of Worker.execute, where a task had the time to go through a executing -> cancelled -> executing/resumed cycle by the time it tries to load up the dependencies; the task will fail with KeyError.
The easiest way to get into this use case in real life is when the task has a sizeable run_spec, which is deserialized in a thread, and the threadpool for whatever reason is somewhat slow to respond.
CC @fjetter