When an instance-start saga unwinds, any VMM it created transitions to the SagaUnwound state. This causes the instance's effective state to appear as Failed in the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in the Failed state ("instance reincarnation"). However, this code will not automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in the Failed state from Nexus' perspective.
We should be automatically restarting instances whose start sagas have failed. From the user's perspective, the purpose of instance reincarnation is "if I say that I would like this instance to be running, the system should make sure it is running" --- it doesn't really matter if the reason it's not running is because it couldn't be started, or because it was running and then something happened to it. Furthermore, it seems not great to say "instances in the Failed state will be automatically restarted" and then have some class of instances which appear to be in the Failed state in the UI but aren't actually automatically restarted. Therefore, we should add code to the instance-reincarnation RPW to reincarnate instances with SagaUnwound VMMs.
I don't really think it makes sense for the auto-restart cooldown period to apply to SagaUnwound instances even if they have failed previously. The purpose of the cooldown period is to prevent "repeat offenders" that fail repeatedly from compromising overall availability, especially when those instances fail in ways that effect other instances on the same sled. However, if an instance's start saga fails, it was never actually running,1 so the failure was not due to a problem with the instance. Therefore, I think the cooldown period need not apply here. We may want to consider a separate cooldown period between start failures, to prevent an instance that, for some reason, consistently fail to be started (i.e. due to requesting an allocation of virtual resources which aren't actually available) from constantly being restarted in a hot loop...
When an
instance-startsaga unwinds, any VMM it created transitions to theSagaUnwoundstate. This causes the instance's effective state to appear asFailedin the external API. PR #6503 added functionality to Nexus to automatically restart instances that are in theFailedstate ("instance reincarnation"). However, this code will not automatically restart instances whose instance-start sagas have unwound, because such instances are not actually in theFailedstate from Nexus' perspective.We should be automatically restarting instances whose start sagas have failed. From the user's perspective, the purpose of instance reincarnation is "if I say that I would like this instance to be running, the system should make sure it is running" --- it doesn't really matter if the reason it's not running is because it couldn't be started, or because it was running and then something happened to it. Furthermore, it seems not great to say "instances in the
Failedstate will be automatically restarted" and then have some class of instances which appear to be in theFailedstate in the UI but aren't actually automatically restarted. Therefore, we should add code to the instance-reincarnation RPW to reincarnate instances withSagaUnwoundVMMs.I don't really think it makes sense for the auto-restart cooldown period to apply to
SagaUnwoundinstances even if they have failed previously. The purpose of the cooldown period is to prevent "repeat offenders" that fail repeatedly from compromising overall availability, especially when those instances fail in ways that effect other instances on the same sled. However, if an instance's start saga fails, it was never actually running,1 so the failure was not due to a problem with the instance. Therefore, I think the cooldown period need not apply here. We may want to consider a separate cooldown period between start failures, to prevent an instance that, for some reason, consistently fail to be started (i.e. due to requesting an allocation of virtual resources which aren't actually available) from constantly being restarted in a hot loop...Footnotes
The last action in the
instance-startsaga is to send a request to the sled-agent, and thus Propolis, to tell it to start trying to boot the instance. If the instance boots and then crashes, that will go through the normalcpapi_instances_put->process_instance_updatepath that transitions the VMM toFailedrather than putting it inSagaUnwound. ↩