Validate and debug state machine on handle_compute_task#6327
Validate and debug state machine on handle_compute_task#6327
Conversation
0a088bf to
91886ad
Compare
Unit Test Results 15 files + 3 15 suites +3 6h 58m 2s ⏱️ + 1h 9m 0s For more details on these failures, see this check. Results for commit 79402f2. ± Comparison against base commit 4b81f06. ♻️ This comment has been updated with latest results. |
ce5476c to
4de1517
Compare
distributed/worker.py
Outdated
| ("released", "error"): self.transition_generic_error, | ||
| ("released", "fetch"): self.transition_released_fetch, | ||
| ("released", "missing"): self.transition_released_fetch, | ||
| ("released", "missing"): self.transition_generic_missing, |
There was a problem hiding this comment.
| for dep_key, value in nbytes.items(): | ||
| self.tasks[dep_key].nbytes = value | ||
|
|
||
| self.update_who_has(who_has) |
There was a problem hiding this comment.
This move prevents deps to be created in resumed state by ensure_task_exists and then remain there because there's nothing actually needing them.
| self.tasks[key].nbytes = value | ||
|
|
||
| if ts.state in READY | {"executing", "waiting", "resumed"}: | ||
| if ts.state in READY | {"executing", "long-running", "waiting", "resumed"}: |
There was a problem hiding this comment.
I omitted a unit test for this - something to write after the state machine refactor for sure
| self.scheduler.who_has, | ||
| keys=[ts.key for ts in self._missing_dep_flight], | ||
| ) | ||
| who_has = {k: v for k, v in who_has.items() if v} |
There was a problem hiding this comment.
Redundant - update_who_has already throws away empty lists of workers
|
|
||
|
|
||
| @gen_cluster(nthreads=[("127.0.0.1", 1)] * 10, client=True, timeout=60) | ||
| @gen_cluster(nthreads=[("", 1)] * 10, client=True) |
There was a problem hiding this comment.
This test is functionally identical to before - all changes are just cosmetic.
Self-review Self-review self-review
f0425fa to
9408cab
Compare
|
The CI failure is on these new lines at the end of for dep_ts in ts.dependencies:
assert dep_ts.state != "released", self.story(dep_ts)I can reproduce it in 0.4% (4 out of 1000) of runs on my desktop. I'll investigate over the next few days. |
|
The one-liner fix for the infinite transition has been merged in #6331. What is left in this PR is a wealth of hardening, which removes some cases for corrupted state and makes other crop up sooner. |
| # ensure_tasks_exists() have been transitioned to fetch or flight | ||
| assert all( | ||
| ts2.state != "released" for ts2 in (ts, *ts.dependencies) | ||
| ), self.story(ts, *ts.dependencies) |
There was a problem hiding this comment.
At the moment of writing, this assertion fails in test_stress_scatter_death 0.4% of the times on a fast desktop.
Explanation in #6305. Resolution out of scope for this PR.
There was a problem hiding this comment.
FYI this assert is not 100% correct. There is a case for valid tasks left in released in the case of cancelled/resumed tasks. I'll open a follow up PR with a case reproducing this condition
There was a problem hiding this comment.
flowchart TD
A1[A1 - forgotten / not known] --> B1[B1 - flight]
A2[A1 - forgotten / not known] --> B1[B1 - flight]
B1 --> C1[C1 - waiting]
free-keys / cancel B1
flowchart TD
A1[A1 - forgotten / not known] --> B1[B1 - cancelled]
A2[A1 - forgotten / not known] --> B1[B1 - cancelled]
B1 --> C1[C1 - forgotten]
compute-task B1
flowchart TD
A1[A1 - released] --> B1[B1 - resumed]
A2[A1 - released] --> B1[B1 - resumed]
B1 --> C1[C1 - forgotten]
gather-dep finishes w/ Error
flowchart TD
A1[A1 - fetch] --> B1[B1 - waiting]
A2[A1 - fetch] --> B1[B1 - waiting]
B1 --> C1[C1 - forgotten]
There was a problem hiding this comment.
I don't think I understand from the above diagram how this can end up with a released state by the end of compute-task?
|
Nothing here concerns me. The biggest change is the transition change, and that was from Guido anyway. The failing test is concerning, but it seems like it's just shining a light on something that was broken before. Merging. |
gjoseph92
left a comment
There was a problem hiding this comment.
This was merged midway through my reviewing, but it also looks good to me. The failing test is just the 0.4% of cases where test_stress_scatter_death is known to still fail, which is a regression that still needs to be addressed #6305.
|
Ah, my apologies for jumping ahead. |
Uh oh!
There was an error while loading. Please reload this page.