Harden preamble of Worker.execute against race conditions#6878
Harden preamble of Worker.execute against race conditions#6878
Conversation
090902b to
38bf365
Compare
| # This is just for internal coherence of the WorkerState; the reschedule | ||
| # message should not ever reach the Scheduler. | ||
| # It is still OK if it does though. | ||
| return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}") |
There was a problem hiding this comment.
The two new tests remain green if you revert the first line of this block:
if self.status in {Status.closing, Status.closed, Status.closing_gracefully}:
return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}")In this case, the reschedule event actually reaches the scheduler.
There was a problem hiding this comment.
I'm concerned that this causes the scheduler to reschedule a task twice. First, because the worker left, and second, because the reschedule event arrives.
Why is the previous version of returning None not sufficient?
Edit:
distributed/distributed/scheduler.py
Lines 6740 to 6741 in 75ef3c9
| "Trying to execute task %s which is not in executing state anymore", | ||
| ts, | ||
| ) | ||
| return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id) |
There was a problem hiding this comment.
The two new tests remain green if, instead of removing this block completely, you just replace the exit event with
return RescheduleEvent(key=key, stimulus_id=f"already-cancelled-{time()}")They also remain green if you copy-paste the same block after the call to _maybe_deserialize_task.
In both cases, the reschedule event reaches the scheduler and behaves as expected.
| ts, | ||
| ) | ||
| return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id) | ||
| if self.status not in WORKER_ANY_RUNNING: |
There was a problem hiding this comment.
align execute and gather_dep
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 24m 56s ⏱️ - 8m 24s For more details on these failures, see this check. Results for commit 38bf365. ± Comparison against base commit 99a2db1. |
fjetter
left a comment
There was a problem hiding this comment.
Looks great! Thanks @crusaderky !
| # This is just for internal coherence of the WorkerState; the reschedule | ||
| # message should not ever reach the Scheduler. | ||
| # It is still OK if it does though. | ||
| return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}") |
There was a problem hiding this comment.
I'm concerned that this causes the scheduler to reschedule a task twice. First, because the worker left, and second, because the reschedule event arrives.
Why is the previous version of returning None not sufficient?
Edit:
distributed/distributed/scheduler.py
Lines 6740 to 6741 in 75ef3c9
test_blockwise_concatenateflaky dask#9330WorkerState.execute#6869closing_gracefullystatus, discussed in Deadlock when emerging from closing_gracefully #6867