SpecCluster correct state can cause inconsistencies

There are a bunch of problems in `SpecCluster._correct_state_internal` such that I believe it should be rewritten

* If one worker fails during startup, all workers are rejected. This can cause the cluster to spin up too many workers
* Similarly, while closing, if one worker fails but others properly shut down, they are not removed from the internal state
* `correct_state_internal` is only called once, i.e. any kind of exception would abort the entire up/downscaling without further attempt to correct the state.
* If the cluster is closing while correct_state is running, nothing is actually cancelled.
* `self._correct_state_waiting` is actually never cancelled.
* `SpecCluster.scale` schedules a callback to self `_correct_state`. This can cause all sorts of race conditions, e.g. by creating more futures even if the cluster is already closing



This list probably continues long and most issues _could_ be addressed individually but I believe we're better off rewriting this section.

cc @graingert 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SpecCluster correct state can cause inconsistencies #5919

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SpecCluster correct state can cause inconsistencies #5919

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions