Nexus should restart `Failed` instances when `boot_on_fault` says to

Depends on #6455 (and probably also #6490). 

Per RFD 486:
> An instance’s `boot_on_fault` discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.

We should implement that.

Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to `Failed`. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the `Failed` state and have `boot_on_fault` disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to `Failed` could just activate that background task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Nexus should restart Failed instances when boot_on_fault says to #6491

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Nexus should restart `Failed` instances when `boot_on_fault` says to #6491