Skip to content

SpecCluster correct state can cause inconsistencies #5919

@fjetter

Description

@fjetter

There are a bunch of problems in SpecCluster._correct_state_internal such that I believe it should be rewritten

  • If one worker fails during startup, all workers are rejected. This can cause the cluster to spin up too many workers
  • Similarly, while closing, if one worker fails but others properly shut down, they are not removed from the internal state
  • correct_state_internal is only called once, i.e. any kind of exception would abort the entire up/downscaling without further attempt to correct the state.
  • If the cluster is closing while correct_state is running, nothing is actually cancelled.
  • self._correct_state_waiting is actually never cancelled.
  • SpecCluster.scale schedules a callback to self _correct_state. This can cause all sorts of race conditions, e.g. by creating more futures even if the cluster is already closing

This list probably continues long and most issues could be addressed individually but I believe we're better off rewriting this section.

cc @graingert

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions