Close workers if an exception occurs during transition by fjetter · Pull Request #4735 · dask/distributed

fjetter · 2021-04-22T13:11:26Z

I'm still not sure why things like #4721 do not pop up in our tests. These things are merely logged but we somehow loose the exception. Maybe that's because workers restart or other workers pick up and we're lucky. Anyhow, I realised that we should probably not even try to recover anything in these cases and just draw a very hard line. Most of our transition methods are not atomic and if something during the transaction breaks, we cannot really guarantee the validity of the state. Easiest way to start from scratch is to close gracefully in these situations.

Opinions?

cc @gforsyth

xref #4413

gforsyth · 2021-04-22T13:27:28Z

During worker refactoring, I've just stared at the stdout of the tests to catch transition errors, or inserted breakpoints into troublesome spots (I mention this only in support of tests catching these errors because the manual method is obviously prone to failures).

My assumption was that the nanny was restarting worker processes that errored out and that allowed tests to succeed even if 20% of the time a given task entered an invalid state.

Maybe we should parametrize some of the worker tests to explicitly disable the nanny?

fjetter · 2021-04-22T13:36:51Z

Maybe we should parametrize some of the worker tests to explicitly disable the nanny?

Everything using gen_cluster by default isn't using Nanny since it is much easier to work with it all if there is no extra process in the way. Default is also False for the cluster ctx manager and these two are mostly used for tests. If anyhing, I would've argued that we're not covering Nanny enough :)

So far, I do not believe that corrupt workers are actually closing. Maybe there are code paths where this is is true but I am not aware of any. I think there must be another reason why we do not spot this. My suspicion is at the moment that we're abusing loop.add_callback too often without handling exception properly, see also #4734

Regardless of why tests do not catch this properly, I believe closing like this is also a valid behaviour in real world situations to avoid deadlocks since the "drop exception (almost) silently" just masks the problem. As I stated above, I think we can no longer guarantee a valid state if a transition fails and must therefore close (or otherwise reconstruct the proper state but that's also just prone to failure)

fjetter · 2022-04-29T16:10:08Z

Was implemented in #6210

Close workers if an exception occurs during transition

2f60857

fjetter mentioned this pull request Apr 13, 2022

Add back Worker.transition_fetch_missing #6112

Merged

3 tasks

fjetter closed this Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Close workers if an exception occurs during transition#4735

Close workers if an exception occurs during transition#4735
fjetter wants to merge 1 commit intodask:mainfrom
fjetter:close_worker_transition_exception

fjetter commented Apr 22, 2021

Uh oh!

gforsyth commented Apr 22, 2021

Uh oh!

fjetter commented Apr 22, 2021

Uh oh!

fjetter commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fjetter commented Apr 22, 2021

Uh oh!

gforsyth commented Apr 22, 2021

Uh oh!

fjetter commented Apr 22, 2021

Uh oh!

fjetter commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants