cancelled/resumed->long-running transitions by crusaderky · Pull Request #6916 · dask/distributed

crusaderky · 2022-08-18T19:29:15Z

github-actions · 2022-08-18T21:44:45Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files +      3       15 suites +3 6h 32m 52s ⏱️ + 1h 55m 48s
  3 066 tests +    13   2 980 ✔️ +    24   85 💤 -   11 1 ❌ ±0
22 676 runs +4 352 21 690 ✔️ +4 229 985 💤 +123 1 ❌ ±0

For more details on these failures, see this check.

Results for commit b9ebe23. ± Comparison against base commit 817ead3.

♻️ This comment has been updated with latest results.

crusaderky · 2022-08-19T14:37:00Z

distributed/tests/test_cancelled_state.py

+
+
+@gen_cluster(client=True, nthreads=[("", 1)], timeout=2)
+async def test_secede_cancelled_or_resumed_scheduler(c, s, a):


Note that this second test does not test the resumed(fetch) use case. However, the first test above demonstrates that the cancelled and resumed(fetch) use cases are indistinguishable from the scheduler's side.

crusaderky · 2022-08-19T14:39:23Z

distributed/worker_state_machine.py

+    def _transition_cancelled_long_running(
+        self, ts: TaskState, compute_duration: float, *, stimulus_id: str
+    ) -> RecsInstrs:
+        """This transition also serves resumed(fetch) -> long-running"""


Not terribly happy about this, but the two alternatives (call it _transition_generic_long_running, when it's definitely not generic, and copy-pasting the whole thing to a _transition_resumed_long_running) seemed worse.

_transition_cancelled_or_resumed_long_running? I'd be happier with a verbose but accurate name.

gjoseph92

I see how this probably the only way to solve the problem, but it feels a little odd to me.

What if we just didn't allow cancelled tasks to secede? While secede is running, we control the thread—we could just raise an error, and refuse to call tpe_secede() or submit the SecedeEvent. So cancelled->long_running would remain an impossible transition. If the exception causes the task to fail, we ignore it anyway. If the user code decides to handle the exception in some way, that's fine, but they'll still never be able to trigger a cancelled->long_running transition, since secede would refuse to do it.

In most cases, we can't cancel the running thread. But secede is the rare case where we do have the opportunity. Seems like it would be simpler to just not have to worry about this transition?

distributed/worker_state_machine.py

gjoseph92 · 2022-08-29T19:36:20Z

distributed/worker_state_machine.py

+    def _transition_cancelled_long_running(
+        self, ts: TaskState, compute_duration: float, *, stimulus_id: str
+    ) -> RecsInstrs:
+        """This transition also serves resumed(fetch) -> long-running"""


_transition_cancelled_or_resumed_long_running? I'd be happier with a verbose but accurate name.

gjoseph92 · 2022-08-29T19:45:50Z

distributed/worker_state_machine.py

+        self.executing.discard(ts)
+        self.long_running.add(ts)
+
+        # Do not send LongRunningMsg


To clarify: the idea is that we don't send the message right now, because the task is cancelled, so from the scheduler's perspective, it's not running anymore on this worker, therefore the scheduler shouldn't hear updates from this worker about that task (xref #6956).

Instead, we postpone sending the LongRunningMsg until the task is un-cancelled. Only then will we send the message, since we know it's relevant.

This seems worth a longer comment?

Overhauled comment

distributed/worker_state_machine.py

gjoseph92 · 2022-08-29T20:02:41Z

distributed/tests/test_cancelled_state.py

+    assert ws.processing
+
+    await ev4.set()
+    assert await x == 123


Wait, so the expected, correct behavior is that you release a future, submit a new future with the same key, and get back the old (cancelled) future's result instead of the new one? That seems pretty wrong to me.

I'm aware that this could happen even for normal tasks, not just long-running, and it's just a consequence of not cancelling the thread, and keeping the TaskState around until the thread finishes. But from an API and user perspective, that seems wrong. I didn't think keys needed to be unique over the lifetime of the cluster, just that they needed to be unique among all the currently-active keys (and once a client saw a key as released, then it could safely consider it inactive).

Yep, but this is how it works. I spent several weeks trying and failing to make it become sensible: #6844

This is a pretty rare use case: a user submits a task with a manually-defined key; then before the task has had the time to finish, it submits a different task with the same key.
Honestly, I feel that the blame should sit on the user entirely here, and figuring out what went wrong should be pretty straightforward. It also should not really happen except when prototyping from a notebook, unless there are key collisions which will cause all sort of weird behaviour anyway.

crusaderky · 2022-08-30T11:34:07Z

I see how this probably the only way to solve the problem, but it feels a little odd to me.

What if we just didn't allow cancelled tasks to secede? While secede is running, we control the thread—we could just raise an error, and refuse to call tpe_secede() or submit the SecedeEvent. So cancelled->long_running would remain an impossible transition. If the exception causes the task to fail, we ignore it anyway. If the user code decides to handle the exception in some way, that's fine, but they'll still never be able to trigger a cancelled->long_running transition, since secede would refuse to do it.

In most cases, we can't cancel the running thread. But secede is the rare case where we do have the opportunity. Seems like it would be simpler to just not have to worry about this transition?

This makes me very nervous, because secede runs in a different thread so there's a risk of very subtle race conditions.

crusaderky · 2022-08-30T11:52:45Z

All review comments have been addressed

gjoseph92

Discussed offline. Though using sync to run a function on the event loop in secede (and raising an error if the task was already cancelled) would be possible and avoid threading race conditions, it would be difficult to test thoroughly. The benefit of secede raising an error if cancelled is also probably small, as most tasks would call secede right away, so there'd be very little time for the task to be cancelled in between. The state-machine-based approach here is much easier to test, so we'll go with this.

crusaderky force-pushed the cancelled_long_running branch from e1206de to a6557e5 Compare August 18, 2022 19:29

crusaderky commented Aug 19, 2022

View reviewed changes

crusaderky self-assigned this Aug 19, 2022

crusaderky requested a review from fjetter August 19, 2022 14:39

crusaderky marked this pull request as ready for review August 19, 2022 14:39

gjoseph92 reviewed Aug 29, 2022

View reviewed changes

gjoseph92 approved these changes Aug 30, 2022

View reviewed changes

cancelled_long_running

b9ebe23

crusaderky force-pushed the cancelled_long_running branch from 2ce602e to b9ebe23 Compare August 30, 2022 16:17

crusaderky merged commit b8e67ca into dask:main Aug 31, 2022

crusaderky deleted the cancelled_long_running branch August 31, 2022 11:48

gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

cancelled/resumed->long-running transitions (dask#6916)

38e43a1



		@gen_cluster(client=True, nthreads=[("", 1)], timeout=2)
		async def test_secede_cancelled_or_resumed_scheduler(c, s, a):

Uh oh!

Conversation

crusaderky commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Aug 30, 2022

Uh oh!

crusaderky commented Aug 30, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crusaderky commented Aug 18, 2022 •

edited

Loading

github-actions bot commented Aug 18, 2022 •

edited

Loading