Close workers if an exception occurs during transition#4735
Close workers if an exception occurs during transition#4735
Conversation
|
During worker refactoring, I've just stared at the stdout of the tests to catch transition errors, or inserted breakpoints into troublesome spots (I mention this only in support of tests catching these errors because the manual method is obviously prone to failures). My assumption was that the nanny was restarting worker processes that errored out and that allowed tests to succeed even if 20% of the time a given task entered an invalid state. Maybe we should parametrize some of the worker tests to explicitly disable the nanny? |
Everything using So far, I do not believe that corrupt workers are actually closing. Maybe there are code paths where this is is true but I am not aware of any. I think there must be another reason why we do not spot this. My suspicion is at the moment that we're abusing Regardless of why tests do not catch this properly, I believe closing like this is also a valid behaviour in real world situations to avoid deadlocks since the "drop exception (almost) silently" just masks the problem. As I stated above, I think we can no longer guarantee a valid state if a transition fails and must therefore close (or otherwise reconstruct the proper state but that's also just prone to failure) |
|
Was implemented in #6210 |
I'm still not sure why things like #4721 do not pop up in our tests. These things are merely logged but we somehow loose the exception. Maybe that's because workers restart or other workers pick up and we're lucky. Anyhow, I realised that we should probably not even try to recover anything in these cases and just draw a very hard line. Most of our transition methods are not atomic and if something during the transaction breaks, we cannot really guarantee the validity of the state. Easiest way to start from scratch is to close gracefully in these situations.
Opinions?
cc @gforsyth
xref #4413