fail_hard on BaseExceptions in user tasks#7330
Conversation
|
I was curious why this test passed on main with |
|
@graingert curious if you have any ideas of better ways to do this. The path is basically: distributed/distributed/worker_state_machine.py Lines 3651 to 3654 in 69b74f9 distributed/distributed/worker_state_machine.py Line 3667 in 69b74f9 The distributed/distributed/worker_state_machine.py Lines 3608 to 3613 in 69b74f9 So the problem is that Broadly, the problem is: distributed/distributed/worker_state_machine.py Line 3609 in 69b74f9 Any unhandled exception in a worker async instruction would be dropped, and probably cause a deadlock. If feels to me like handling these errors should be the responsibility of the |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 18 files + 3 18 suites +3 0s ⏱️ - 6h 36m 59s For more details on these failures, see this check. Results for commit 65c78f0. ± Comparison against base commit 69b74f9. |
|
As you can see, this breaks CI because it kills the pytest process at a couple of tests: I think this is indicating some weakness in how the worker state machine design handles errors in its async tasks. For For But broadly, since we don't have a structured way of handling exceptions in async tasks in the worker state machine (such as |
|
I think #5997 may be a better approach. Closing for now. |
If user code raises a
BaseException, it kills the whole worker (viafail_hard).Semantically, I'm still back and forth whether this makes more sense or #5997. But this is probably less controversial.
Closes #5958
pre-commit run --all-files