@fail_hard can kill the whole test suite; hide errors#6474
@fail_hard can kill the whole test suite; hide errors#6474crusaderky merged 1 commit intodask:mainfrom
@fail_hard can kill the whole test suite; hide errors#6474Conversation
| total = c.submit(sum, L) | ||
| result = await total | ||
| assert result == sum(map(inc, range(10))) | ||
| kill_task = asyncio.create_task(n.kill()) |
There was a problem hiding this comment.
- Do not leave uncollected task at the end of the test
- Test separately the use case of failed worker -> other workers vs. other workers -> failed worker
There was a problem hiding this comment.
this test seems very timing sensitive, how far through is the n.kill() task and what state is the Nanny in by the time the Client._gather_remote( gets called?
There was a problem hiding this comment.
eg, is the test still valid if asyncio.gather( is used to communicate the concurrency here?
compute_addr = n.worker_address if compute_on_failed else a.address
async def compute_total():
return await c.submit(sum, L, workers=[compute_addr], allow_other_workers=True)
total, _ = await asyncio.gather(compute_total(), n.kill())
assert total == sum(range(1, 11))There was a problem hiding this comment.
The test is indeed very timing sensitive, and as I discovered elsewhere it can need to run a good 100+ times to trigger unexpected behaviour. I don't plan to fix this in the scope of this PR. The change is just to clean up the code.
| or not self.batched_stream.comm | ||
| or self.batched_stream.comm.closed() | ||
| ): | ||
| return # pragma: nocover |
Unit Test Results 15 files ± 0 15 suites ±0 6h 12m 15s ⏱️ - 21m 39s For more details on these failures, see this check. Results for commit 3efcf2c. ± Comparison against base commit 5feb171. ♻️ This comment has been updated with latest results. |
| # deadlocks the cluster. | ||
| if not self.nanny: | ||
| # We're likely in a unit test. Don't kill the whole test suite! | ||
| raise |
There was a problem hiding this comment.
There are many cases where folks don't use a nanny in production, but we would still want to raise. "no nanny" doesn't imply "in tests".
Instead, I recommend checking a config value here and then setting that config value in testing infrastructure.
Alternatively, maybe we set a global in utils_test for TESTING or something like that.
Also alternatively, self.validate is probably a good signal that we're in tests today.
There was a problem hiding this comment.
Added check for self.validate
| ) from e | ||
|
|
||
| def validate_state(self): | ||
| if self.status not in WORKER_ANY_RUNNING: |
There was a problem hiding this comment.
Can this not happen anymore?
There was a problem hiding this comment.
Yes, it does happen after @fail_hard closes the worker, and it's why I'm removing it.
With it, a test may be green even if a worker has suicided.
In production, the cluster must be transparently resilient to failures on individual workers (that's the philosophy behind fail_hard)
In tests, it really really shouldn't.
@fail_hardin the test suite, the whole test suite is killed with an opaque 'exit 1'@gen_clusterhas been closed for whatever reason - namely, by@fail_hard- it is spared from state validation, potentially resulting in a green test.