`@fail_hard` can kill the whole test suite; hide errors by crusaderky · Pull Request #6474 · dask/distributed

crusaderky · 2022-05-30T15:33:07Z

Fix issue where, if a Worker not wrapped by Nanny trips @fail_hard in the test suite, the whole test suite is killed with an opaque 'exit 1'
Fix issue where, after a worker started by @gen_cluster has been closed for whatever reason - namely, by @fail_hard - it is spared from state validation, potentially resulting in a green test.

crusaderky · 2022-05-30T15:33:58Z

distributed/tests/test_failed_workers.py

-        total = c.submit(sum, L)
-        result = await total
-        assert result == sum(map(inc, range(10)))
+        kill_task = asyncio.create_task(n.kill())


Do not leave uncollected task at the end of the test

Test separately the use case of failed worker -> other workers vs. other workers -> failed worker

this test seems very timing sensitive, how far through is the n.kill() task and what state is the Nanny in by the time the Client._gather_remote( gets called?

eg, is the test still valid if asyncio.gather( is used to communicate the concurrency here?

compute_addr = n.worker_address if compute_on_failed else a.address async def compute_total(): return await c.submit(sum, L, workers=[compute_addr], allow_other_workers=True) total, _ = await asyncio.gather(compute_total(), n.kill()) assert total == sum(range(1, 11))

The test is indeed very timing sensitive, and as I discovered elsewhere it can need to run a good 100+ times to trigger unexpected behaviour. I don't plan to fix this in the scope of this PR. The change is just to clean up the code.

crusaderky · 2022-05-30T16:26:10Z

distributed/worker.py

+            or not self.batched_stream.comm
+            or self.batched_stream.comm.closed()
+        ):
+            return  # pragma: nocover


To be removed after #6475

github-actions · 2022-05-30T19:08:58Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 12m 15s ⏱️ - 21m 39s
  2 820 tests +  1   2 739 ✔️ +  3   80 💤 ±  0 1 ❌ - 2
20 910 runs +14 19 916 ✔️ - 42 993 💤 +58 1 ❌ - 2

For more details on these failures, see this check.

Results for commit 3efcf2c. ± Comparison against base commit 5feb171.

♻️ This comment has been updated with latest results.

mrocklin · 2022-05-31T15:09:29Z

distributed/worker.py

+        # deadlocks the cluster.
+        if not self.nanny:
+            # We're likely in a unit test. Don't kill the whole test suite!
+            raise


There are many cases where folks don't use a nanny in production, but we would still want to raise. "no nanny" doesn't imply "in tests".

Instead, I recommend checking a config value here and then setting that config value in testing infrastructure.

Alternatively, maybe we set a global in utils_test for TESTING or something like that.

Also alternatively, self.validate is probably a good signal that we're in tests today.

Added check for self.validate

gjoseph92 · 2022-06-01T15:14:05Z

distributed/worker.py

            ) from e

    def validate_state(self):
-        if self.status not in WORKER_ANY_RUNNING:


Can this not happen anymore?

Yes, it does happen after @fail_hard closes the worker, and it's why I'm removing it.
With it, a test may be green even if a worker has suicided.

In production, the cluster must be transparently resilient to failures on individual workers (that's the philosophy behind fail_hard)
In tests, it really really shouldn't.

crusaderky commented May 30, 2022

View reviewed changes

crusaderky mentioned this pull request May 30, 2022

Remove EnsureCommunicatingAfterTransitions #6462

Merged

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 30, 2022

Moved to dask#6474

d0d03ee

crusaderky commented May 30, 2022

View reviewed changes

crusaderky mentioned this pull request May 30, 2022

Encapsulate Worker.batched_stream.send() #6475

Merged

crusaderky force-pushed the fail_hard branch from 429a84a to 21cff11 Compare May 30, 2022 16:51

crusaderky marked this pull request as ready for review May 30, 2022 19:14

crusaderky self-assigned this May 30, 2022

crusaderky mentioned this pull request May 30, 2022

Yank state machine out of Worker class #6476

Closed

crusaderky linked an issue May 30, 2022 that may be closed by this pull request

Yank state machine out of Worker class #6476

Closed

mrocklin reviewed May 31, 2022

View reviewed changes

crusaderky force-pushed the fail_hard branch from c22fb9d to 0de3b6f Compare June 1, 2022 12:15

@fail_hard can kill the whole test suite; hide errors

be7f70a

crusaderky force-pushed the fail_hard branch from 0de3b6f to be7f70a Compare June 1, 2022 14:32

gjoseph92 approved these changes Jun 1, 2022

View reviewed changes

crusaderky merged commit e16ed18 into dask:main Jun 1, 2022

crusaderky deleted the fail_hard branch June 1, 2022 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`@fail_hard` can kill the whole test suite; hide errors#6474

`@fail_hard` can kill the whole test suite; hide errors#6474
crusaderky merged 1 commit intodask:mainfrom
crusaderky:fail_hard

crusaderky commented May 30, 2022 •

edited

Loading

Uh oh!

crusaderky May 30, 2022 •

edited

Loading

Uh oh!

graingert May 31, 2022

Uh oh!

graingert May 31, 2022

Uh oh!

crusaderky May 31, 2022

Uh oh!

crusaderky May 30, 2022

Uh oh!

github-actions bot commented May 30, 2022 •

edited

Loading

Uh oh!

mrocklin May 31, 2022

Uh oh!

crusaderky Jun 1, 2022

Uh oh!

gjoseph92 Jun 1, 2022

Uh oh!

crusaderky Jun 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

crusaderky commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crusaderky commented May 30, 2022 •

edited

Loading

crusaderky May 30, 2022 •

edited

Loading

github-actions bot commented May 30, 2022 •

edited

Loading