Add VmmState::SagaUnwound#5855
Conversation
gjcolombo
left a comment
There was a problem hiding this comment.
This looks right to me (I'm happy to recheck it once the state enum split merges and the PR is targeting main). Just a couple of small comments.
0c55ef3 to
1d91b71
Compare
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
gjcolombo
left a comment
There was a problem hiding this comment.
LGTM, thanks for the change!
I think I still accept my past self's reason for introducing this state (to recap for other readers: we want to introduce an "instance state update" task; that task requires the exclusive right to clear an instance's Propolis ID fields; unwinding start/migrate sagas therefore need a way to mark that a VMM needs to be GC'ed by the update task). That said, it's sort of a bummer that there are now two states that represent a stopped instance (InstanceState::NoVmm and InstanceState::Vmm + VmmState::SagaUnwound). As we author and review the update task, it might be worth a little extra elbow grease to make absolutely sure we can't avoid the "updater task is the only thing that can set Propolis IDs to NULL" restriction. But I'm confident enough in that requirement that I'm good with merging this PR.
This branch is part of ongoing work on the
instance-updatesaga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time.Depends on #5854 and should merge only after that PR.
As part of #5749, it is desirable to distinguish between why a VMM record was marked as
Destroyed, in order to determine which saga is responsible for cleaning up that VMM's resources. Theinstance-updatesaga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, theinstance-startandinstance-migratesagas may also do this when they unwind. This is a requirement to avoid situations where a failedinstance-updatesaga cannot unwind, because the instance's generation number has changed while the saga was executing.We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state,
SagaUnwound, with the following rules:active_propolis_idordestination_propolis_id(respectively) is eitherNULLor refers to a VMM that’s in theSagaUnwoundstate (the new VMM ID directly replaces the old ID).active_propolis_idin theSagaUnwoundstate, then the instance appears to beStopped.destination_propolis_idin theSagaUnwoundstate, nothing special happens–the instance’s state is still derived from its active VMM’s state.SagaUnwoundVMMs as identical toDestroyedVMMs for purposes of deciding whether to remove a VMM ID from an instance.This branch adds a new
VmmState::SagaUnwoundvariant. TheSagaUnwoundstate is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my theinstance-startandinstance-migratesagas when they unwind. When determining the API instance state from an instance and active VMM query so that theSagaUnwoundstate is mapped toDestroyed.Closes #5848, which this replaces.