-
-
Notifications
You must be signed in to change notification settings - Fork 757
Open
Labels
bugSomething is brokenSomething is broken
Description
There are a bunch of problems in SpecCluster._correct_state_internal such that I believe it should be rewritten
- If one worker fails during startup, all workers are rejected. This can cause the cluster to spin up too many workers
- Similarly, while closing, if one worker fails but others properly shut down, they are not removed from the internal state
correct_state_internalis only called once, i.e. any kind of exception would abort the entire up/downscaling without further attempt to correct the state.- If the cluster is closing while correct_state is running, nothing is actually cancelled.
self._correct_state_waitingis actually never cancelled.SpecCluster.scaleschedules a callback to self_correct_state. This can cause all sorts of race conditions, e.g. by creating more futures even if the cluster is already closing
This list probably continues long and most issues could be addressed individually but I believe we're better off rewriting this section.
cc @graingert
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething is brokenSomething is broken