Validate and debug state machine on handle_compute_task by crusaderky · Pull Request #6327 · dask/distributed

crusaderky · 2022-05-11T20:19:14Z

Partially closes test_stress_scatter_death #6305

github-actions · 2022-05-12T00:30:33Z

Unit Test Results

      15 files +      3       15 suites +3 6h 58m 2s ⏱️ + 1h 9m 0s
  2 774 tests ±      0   2 694 ✔️ +    13   79 💤 -   12 1 ❌ - 1
20 580 runs +3 962 19 672 ✔️ +3 850 907 💤 +113 1 ❌ - 1

For more details on these failures, see this check.

Results for commit 79402f2. ± Comparison against base commit 4b81f06.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-12T15:06:56Z

distributed/worker.py

            ("released", "error"): self.transition_generic_error,
            ("released", "fetch"): self.transition_released_fetch,
-            ("released", "missing"): self.transition_released_fetch,
+            ("released", "missing"): self.transition_generic_missing,


Read: https://github.com/dask/distributed/pull/6248/files#r870780929

crusaderky · 2022-05-12T15:19:24Z

distributed/worker.py

+            for dep_key, value in nbytes.items():
+                self.tasks[dep_key].nbytes = value
+
+            self.update_who_has(who_has)


This move prevents deps to be created in resumed state by ensure_task_exists and then remain there because there's nothing actually needing them.

crusaderky · 2022-05-12T15:31:46Z

distributed/worker.py

-                self.tasks[key].nbytes = value

-        if ts.state in READY | {"executing", "waiting", "resumed"}:
+        if ts.state in READY | {"executing", "long-running", "waiting", "resumed"}:


I omitted a unit test for this - something to write after the state machine refactor for sure

crusaderky · 2022-05-12T15:32:51Z

distributed/worker.py

                self.scheduler.who_has,
                keys=[ts.key for ts in self._missing_dep_flight],
            )
-            who_has = {k: v for k, v in who_has.items() if v}


Redundant - update_who_has already throws away empty lists of workers

crusaderky · 2022-05-12T15:34:01Z

distributed/tests/test_stress.py



-@gen_cluster(nthreads=[("127.0.0.1", 1)] * 10, client=True, timeout=60)
+@gen_cluster(nthreads=[("", 1)] * 10, client=True)


This test is functionally identical to before - all changes are just cosmetic.

Self-review Self-review self-review

crusaderky · 2022-05-12T20:28:36Z

The CI failure is on these new lines at the end of handle_compute_task:

            for dep_ts in ts.dependencies:
                assert dep_ts.state != "released", self.story(dep_ts)

I can reproduce it in 0.4% (4 out of 1000) of runs on my desktop. I'll investigate over the next few days.
In the meantime, I think this PR can be reviewed and merged as is.

CC @fjetter @gjoseph92 @graingert

crusaderky · 2022-05-13T10:19:52Z

The one-liner fix for the infinite transition has been merged in #6331.
Explanation is here: https://github.com/dask/distributed/pull/6248/files#r870780929

What is left in this PR is a wealth of hardening, which removes some cases for corrupted state and makes other crop up sooner.

crusaderky · 2022-05-13T10:25:05Z

distributed/worker.py

+            # ensure_tasks_exists() have been transitioned to fetch or flight
+            assert all(
+                ts2.state != "released" for ts2 in (ts, *ts.dependencies)
+            ), self.story(ts, *ts.dependencies)


At the moment of writing, this assertion fails in test_stress_scatter_death 0.4% of the times on a fast desktop.
Explanation in #6305. Resolution out of scope for this PR.

FYI this assert is not 100% correct. There is a case for valid tasks left in released in the case of cancelled/resumed tasks. I'll open a follow up PR with a case reproducing this condition

flowchart TD A1[A1 - forgotten / not known] --> B1[B1 - flight] A2[A1 - forgotten / not known] --> B1[B1 - flight] B1 --> C1[C1 - waiting]

Loading

free-keys / cancel B1

flowchart TD A1[A1 - forgotten / not known] --> B1[B1 - cancelled] A2[A1 - forgotten / not known] --> B1[B1 - cancelled] B1 --> C1[C1 - forgotten]

Loading

compute-task B1

flowchart TD A1[A1 - released] --> B1[B1 - resumed] A2[A1 - released] --> B1[B1 - resumed] B1 --> C1[C1 - forgotten]

Loading

gather-dep finishes w/ Error

flowchart TD A1[A1 - fetch] --> B1[B1 - waiting] A2[A1 - fetch] --> B1[B1 - waiting] B1 --> C1[C1 - forgotten]

Loading

I don't think I understand from the above diagram how this can end up with a released state by the end of compute-task?

mrocklin · 2022-05-13T19:51:17Z

Nothing here concerns me. The biggest change is the transition change, and that was from Guido anyway. The failing test is concerning, but it seems like it's just shining a light on something that was broken before. Merging.

gjoseph92

This was merged midway through my reviewing, but it also looks good to me. The failing test is just the 0.4% of cases where test_stress_scatter_death is known to still fail, which is a regression that still needs to be addressed #6305.

mrocklin · 2022-05-13T20:17:09Z

Ah, my apologies for jumping ahead.

crusaderky changed the title ~~Infinite transition loop~~ Infinite released->missing transition loop May 11, 2022

crusaderky self-assigned this May 11, 2022

crusaderky force-pushed the test_scatter_death branch from 0a088bf to 91886ad Compare May 11, 2022 20:54

crusaderky added 2 commits May 12, 2022 12:32

test cleanup

87ff61b

Fix infinite transition loop

3ef5ba7

crusaderky force-pushed the test_scatter_death branch from ce5476c to 4de1517 Compare May 12, 2022 14:23

crusaderky commented May 12, 2022

View reviewed changes

Validation

9408cab

Self-review Self-review self-review

crusaderky force-pushed the test_scatter_death branch from f0425fa to 9408cab Compare May 12, 2022 15:40

crusaderky marked this pull request as ready for review May 12, 2022 20:28

This was referenced May 12, 2022

Release 2022.05.1 dask/community#245

Closed

test_stress_scatter_death #6305

Closed

Transition table as a ClassVar #6331

Merged

Prevent infinite transition loops; more aggressive validate_state() #6318

Merged

Merge branch 'main' into test_scatter_death

49d9a5b

validate

8adc739

crusaderky commented May 13, 2022

View reviewed changes

crusaderky added 2 commits May 13, 2022 12:08

handle_compute_task to log ts state when it finds it

070d2ef

more readable keys

329d1b0

crusaderky changed the title ~~Infinite released->missing transition loop~~ Validate and debug state machine on handle_compute_task May 13, 2022

Merge branch 'main' into test_scatter_death

79402f2

mrocklin merged commit 79d5a77 into dask:main May 13, 2022

gjoseph92 reviewed May 13, 2022

View reviewed changes

crusaderky deleted the test_scatter_death branch May 13, 2022 20:41

fjetter mentioned this pull request May 18, 2022

Remove wrong assert in handle compute #6370

Merged



		@gen_cluster(nthreads=[("127.0.0.1", 1)] * 10, client=True, timeout=60)
		@gen_cluster(nthreads=[("", 1)] * 10, client=True)

Uh oh!

Conversation

crusaderky commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky May 12, 2022

Choose a reason for hiding this comment

Uh oh!

crusaderky commented May 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky commented May 13, 2022

Uh oh!

crusaderky May 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter May 18, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky May 18, 2022

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 13, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crusaderky commented May 11, 2022 •

edited

Loading

github-actions bot commented May 12, 2022 •

edited

Loading

crusaderky commented May 12, 2022 •

edited

Loading

crusaderky May 13, 2022 •

edited

Loading

fjetter May 18, 2022 •

edited

Loading