Depends on #6455 (and probably also #6490).
Per RFD 486:
An instance’s boot_on_fault discipline tells Nexus whether to try to recover after retiring a failed VMM. The options are to do nothing (the default) or to try to restart the instance automatically.
We should implement that.
Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in the Failed state and have boot_on_fault disciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance to Failed could just activate that background task.
Depends on #6455 (and probably also #6490).
Per RFD 486:
We should implement that.
Potentially, we could attempt to schedule a new start saga for an instance as part of the update saga that transitions it to
Failed. However, regardless of whether or not we do that, there should definitely be a RPW that's responsible for periodically listing instances which are in theFailedstate and haveboot_on_faultdisciplines indicating that they should be restarted, and ensure that a start saga is started for those instances. Update sagas which have transitioned an instance toFailedcould just activate that background task.