Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 44m 54s ⏱️ - 21m 1s For more details on these failures and errors, see this check. Results for commit 2e48ace. ± Comparison against base commit 1d0701b. ♻️ This comment has been updated with latest results. |
|
Not sure how best to add comments to the above table. I added another row with comments.
I fully acknowledge that I only skimmed the implementation. It doesn't look as bad as I thought but based on the table I would've expected some changes to the scheduler as well which makes me a bit nervous. FWIW, as already outlined above, I believe the most important real world use case is |
|
I'll have a closer look at the code and will provide more feedback about this proposal. I'm currently a bit skeptical about removing it since I think we need it as a complement to |
157d4a9 to
871cf51
Compare
The resumed state is exceptionally complicated and a very frequent source of problems. This PR removes the 'resumed' state and the
TaskState.nextattribute.This PR also deals with the issue of tasks with waiters that transition to error - which is typical, but not exclusive, of the cancelled state - see issues below. The waiters are now sent back to the scheduler.
_transition_from_resumedcontains legacy code and documentation #6693resumed->rescheduledis an invalid transition #6685Design
free-keys
previous=executing
On completion: quietly release
free-keys
previous=flight
On completion: quietly release
free-keys
compute[1]
free-keys
fetch[2]
free-keys
fetch[2]
previous=executing
next=fetch
On success: add-keys
On compute failure: cluster deadlock (#6689)
On reschedule: InvalidTransition (#6685)
previous=None
On success: task-finished
On compute failure: task-erred[5]; reschedule dependents[6]
On reschedule: reschedule task and its dependents[6]
free-keys
compute[1]
previous=flight
next=waiting
On success: task-finished with bogus metrics
On peer failure [3]: transition to waiting
On (un)pickle failure [4]: task-erred
previous=None
On success: add-keys
On peer failure [3]: transition to fetch or missing; on the scheduler side, request_who_has reschedules
On (un)pickle failure [4]: task-erred
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: cluster deadlock (#6705)
On success: add-keys
On peer failure [3]: transition to fetch or missing
On (un)pickle failure [4]: task-erred[5] and reschedule dependents[6]
Notes
[1] ComputeTaskEvent(key=<key>)
[2] ComputeTaskEvent(who_has={<key>: [...]} or AcquireReplicasEvent(who_has={<key>: [...]})
[3] GatherDepSuccessEvent without the requested key or GatherDepNetworkFailureEvent
[4] GatherDepFailureEvent, typically caused by a failure to unpickle, or GatherDepSuccessEvent for a task that is larger than 60% max_memory, is thus spilled immediately, and fails to pickle.
[5] The task-erred messages introduce a new scheduler-side use case, where the scheduler receives a task-erred message for a task that is already in memory. At the moment, this use case is a no-op.
[6] rescheduling waiters implies introducing a new waiting->rescheduled transition
TODO
Regardless of TODOs, this is a gargantuan change which won't go in before @fjetter has come back and has had the time to thoroughly review it.