Remove resumed state by crusaderky · Pull Request #6716 · dask/distributed

crusaderky · 2022-07-12T12:20:16Z

The resumed state is exceptionally complicated and a very frequent source of problems. This PR removes the 'resumed' state and the TaskState.next attribute.

This PR also deals with the issue of tasks with waiters that transition to error - which is typical, but not exclusive, of the cancelled state - see issues below. The waiters are now sent back to the scheduler.

Mutually exclusive with Overhaul transitions for the resumed state #6699
Mutually exclusive with Remove cancelled, resumed, and long-running states #6844
Closes Resumed tasks don't release resources if they fail #6682
Closes Deadlock when a flaky task is stolen #6689
Closes _transition_from_resumed contains legacy code and documentation #6693
Closes resumed->rescheduled is an invalid transition #6685
Closes Failure to pickle/unpickle on flight breaks the worker state machine #6705

Design

Events	Main	This PR
compute[1] free-keys	status=cancelled previous=executing On completion: quietly release	(same)
fetch[2] free-keys	status=cancelled previous=flight On completion: quietly release	(same)
compute[1] free-keys compute[1]	status=executing	(same)
fetch[2] free-keys fetch[2]	status=flight	(same)
compute[1] free-keys fetch[2]	status=resumed previous=executing next=fetch On success: add-keys On compute failure: cluster deadlock (#6689) On reschedule: InvalidTransition (#6685)	status=executing previous=None On success: task-finished On compute failure: task-erred[5]; reschedule dependents[6] On reschedule: reschedule task and its dependents[6]
fetch[2] free-keys compute[1]	status=resumed previous=flight next=waiting On success: task-finished with bogus metrics On peer failure [3]: transition to waiting On (un)pickle failure [4]: task-erred	status=flight previous=None On success: add-keys On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules On (un)pickle failure [4]: task-erred
fetch[2]	status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: cluster deadlock (#6705)	status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6]

Notes

[1] ComputeTaskEvent(key=<key>)
[2] ComputeTaskEvent(who_has={<key>: [...]} or AcquireReplicasEvent(who_has={<key>: [...]})
[3] GatherDepSuccessEvent without the requested key or GatherDepNetworkFailureEvent
[4] GatherDepFailureEvent, typically caused by a failure to unpickle, or GatherDepSuccessEvent for a task that is larger than 60% max_memory, is thus spilled immediately, and fails to pickle.
[5] The task-erred messages introduce a new scheduler-side use case, where the scheduler receives a task-erred message for a task that is already in memory. At the moment, this use case is a no-op.

if ts.retries > 0, decrease it by one and, on the reporting worker only, release the task and all its dependents
if ts.retries = 0, transition the task to erred, release the task on the workers where it is in memory, and release all dependents everywhere.

[6] rescheduling waiters implies introducing a new waiting->rescheduled transition

TODO

The implementation does not fully reflect the above design yet
Failing tests:
- FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_sync
- FAILED distributed/tests/test_failed_workers.py::test_submit_after_failed_worker_async[False]
- FAILED distributed/tests/test_worker.py::test_task_flight_compute_oserror - a...
- The whole test_cancelled_state.py (I didn't look at it yet)
Write more tests for waiters being sent back to the scheduler
Ensure all tickets above meet DoD

Regardless of TODOs, this is a gargantuan change which won't go in before @fjetter has come back and has had the time to thoroughly review it.

github-actions · 2022-07-12T13:09:16Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 44m 54s ⏱️ - 21m 1s
  3 044 tests +  44   1 158 ✔️ - 1 750   33 💤 -   56 109 ❌ +106 1 744 🔥 +1 744
22 501 runs +255 15 729 ✔️ - 5 463 790 💤 - 261 771 ❌ +768 5 211 🔥 +5 211

For more details on these failures and errors, see this check.

Results for commit 2e48ace. ± Comparison against base commit 1d0701b.

♻️ This comment has been updated with latest results.

fjetter · 2022-08-04T16:05:34Z

Not sure how best to add comments to the above table. I added another row with comments.

Events	Main	This PR	Comment
compute[1] free-keys fetch[2]	status=resumed previous=executing next=fetch On success: add-keys On compute failure: cluster deadlock (#6689) On reschedule: InvalidTransition (#6685)	status=executing previous=None On success: task-finished On compute failure: task-erred[5]; reschedule dependents[6] On reschedule: reschedule task and its dependents[6]	About compute failure: Why would we even send anything in this case? Why would we need to reschedule any dependents?
fetch[2] free-keys compute[1]	status=resumed previous=flight next=waiting On success: task-finished with bogus metrics On peer failure [3]: transition to waiting On (un)pickle failure [4]: task-erred	status=flight previous=None On success: add-keys On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules On (un)pickle failure [4]: task-erred	On peer failure - This is actually the only sane use case of this state. All other combinations are mostly artifacts due to race conditions. The fundamental intention of the resumed state is "wait for the fetch to time out and start to compute the task afterwards". A transition to fetch after this seems just wrong since the most recent scheduler request is clearly compute. Same for the state missing. How can a "computable task" be missing? I think by transitioning to fetch or missing we'd mix up semantics. I also don't see how the system would actually end up computing the task in the end. Can you elaborate more how this would work out?
fetch[2]	status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: cluster deadlock (#6705)	status=flight On success: add-keys On peer failure [3]: transition to fetch or missing On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6]	I believe this is a known issue #4439 I don't see how this connects to the resumed state. While not ideal, I do not consider this a major issue which would warrant such a big change

I fully acknowledge that resumed is not an intuitive state. However, the table above also doesn't feel very intuitive to me and I'd like to discuss intended behavior a bit more thoroughly before we commit on this approach.

I only skimmed the implementation. It doesn't look as bad as I thought but based on the table I would've expected some changes to the scheduler as well which makes me a bit nervous.

FWIW, as already outlined above, I believe the most important real world use case is fetch->free-keys->compute and we should really get this one right. IIUC this is working on main

fjetter · 2022-08-08T10:20:57Z

I'll have a closer look at the code and will provide more feedback about this proposal. I'm currently a bit skeptical about removing it since I think we need it as a complement to cancelled and I don't think we should remove cancelled, see #6844 (review)

crusaderky self-assigned this Jul 12, 2022

crusaderky mentioned this pull request Jul 12, 2022

Overhaul transitions for the resumed state #6699

Merged

This was referenced Jul 12, 2022

Improve tests for cancelled state #6717

Merged

cancelled->long-running and resumed->long-running are invalid transitions #6709

Closed

Validate all state-based collections in WorkerState #6708

Closed

crusaderky changed the title ~~Removed resumed state~~ Remove resumed state Jul 12, 2022

crusaderky requested a review from fjetter July 12, 2022 13:42

crusaderky force-pushed the no_resumed branch from 270af40 to f9dd742 Compare July 13, 2022 00:24

crusaderky mentioned this pull request Aug 6, 2022

Remove cancelled, resumed, and long-running states #6844

Closed

jrbourbeau mentioned this pull request Aug 8, 2022

test_blockwise_concatenate flaky dask/dask#9330

Closed

crusaderky force-pushed the no_resumed branch 3 times, most recently from 157d4a9 to 871cf51 Compare August 12, 2022 08:35

no_resumed

2e48ace

crusaderky force-pushed the no_resumed branch from 8955cc3 to 2e48ace Compare August 17, 2022 14:56

crusaderky closed this Aug 18, 2022

crusaderky deleted the no_resumed branch August 19, 2022 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove resumed state#6716

Remove resumed state#6716
crusaderky wants to merge 1 commit intodask:mainfrom
crusaderky:no_resumed

crusaderky commented Jul 12, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Jul 12, 2022 •

edited

Loading

Uh oh!

fjetter commented Aug 4, 2022 •

edited

Loading

Uh oh!

fjetter commented Aug 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

crusaderky commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Notes

TODO

Uh oh!

github-actions bot commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

fjetter commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Aug 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crusaderky commented Jul 12, 2022 •

edited

Loading

github-actions bot commented Jul 12, 2022 •

edited

Loading

fjetter commented Aug 4, 2022 •

edited

Loading