Prevent infinite transition loops; more aggressive validate_state()#6318
Prevent infinite transition loops; more aggressive validate_state()#6318crusaderky merged 4 commits intodask:mainfrom
Conversation
Unit Test Results 15 files - 1 15 suites - 1 7h 2m 0s ⏱️ - 27m 11s For more details on these failures, see this check. Results for commit 35a5568. ± Comparison against base commit 9f02e7a. ♻️ This comment has been updated with latest results. |
be9bbe2 to
3cececc
Compare
|
https://github.com/dask/distributed/runs/6386832455?check_suite_focus=true now shows the issue in #6305 Note: this PR is likely to increase flakiness, and I think it's a good thing, as it will crop up a lot of previously undetected issues which are likely to cause deadlocks somewhere else. |
d84b201 to
35a5568
Compare
gjoseph92
left a comment
There was a problem hiding this comment.
Though a little odd to expose in top-level config, this seems like a good thing to have and great to have in all tests.
To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?
| # catch potential infinite recursions | ||
| self.transition_counter += 1 | ||
| if self.validate and self.transition_counter_max: | ||
| assert self.transition_counter < self.transition_counter_max |
There was a problem hiding this comment.
Might be nice if this raised something like a TransitionCounterMaxExceeded error, to be consistent with workers
There was a problem hiding this comment.
yeah, but scheduler doesn't have anything like InvalidTransitionError on the worker
Correct |
|
@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working? |
See #6327 |
fjetter
left a comment
There was a problem hiding this comment.
I would like to have a conversation about this. I think this limit should be removed again.
I don't see a huge value in it and would prefer having fewer configuration options, even for tests alone. Whenever we may write stress tests, etc. we'd need to adjust this limit (we can already see a few cases where the "default" limit appears to not be sufficient).
| # Cause scheduler and workers to break if they reach this many transitions. | ||
| # Used to debug infinite transition loops. | ||
| # Note: setting this will cause healthy long-running services to eventually break. | ||
| transition-counter-max: False |
There was a problem hiding this comment.
I think we should try to not clutter this file with stuff we only use for internal tests. Apart from tests I don't see the usefulness of having a global limit on the number of transitions
There was a problem hiding this comment.
Indeed, this is for tests only.
Partially closes #6305
#6305 introduces an infinite loop in the worker transitions.
Such loop never releases the GIL,
which means that the
@gen_clustertimeout doesn't work,which means that after 5 minutes pytest_timeout kicks in,
which means that you get no logs and no cluster dump whatsoever.
This PR: