Split instance state enum into instance/VMM state enums#5854
Conversation
|
In addition to fixing up clippy, I want to see if I can add some DB constraints to require that an instance is in the |
| @@ -733,7 +735,7 @@ impl super::Nexus { | |||
| _ => { | |||
| return Err(Error::conflict(&format!( | |||
| "instance is in state {} but must be {} to be started", | |||
There was a problem hiding this comment.
nit: perhaps this should explicitly say
| "instance is in state {} but must be {} to be started", | |
| "VMM is in state {} but must be {} to be started", |
given that we're no longer referencing the public instance state type here?
There was a problem hiding this comment.
My concern here is that I think this message is externally-visible (i.e. you will get it back as context if you make a start call that hits this error path), and we don't really have a first-class notion of a "VMM" in the external API. In cases like these the active VMM's state is the instance's state by definition, so I think it makes sense to refer to it as an instance state.
(That said, I know I've written other error messages that don't hold to the "don't talk about VMMs" rule, so it's probably just a suggestion at most...)
There was a problem hiding this comment.
Hmm, that's fair. I guess I feel like it's a bit weird to be talking about an "instance" being in a state which is different from the externally visible list of "instance states" in the OpenAPI spec, but...I dunno.
There was a problem hiding this comment.
Good point--if this is an external message it should refer to the externally-visible state names, not our inside-baseball data model names.
It's a bit crude, but we could just wrap the VMM state in an InstanceState::from before displaying it; that will at least produce the state name that the caller would have seen if it had done an instance_get that read the same instance/VMM states that the match operated on to make this decision. WDYT?
| | DbVmmState::Stopped | ||
| | DbVmmState::Failed => { | ||
| Err(Error::invalid_request(format!( | ||
| "cannot connect to serial console of instance in state \"{}\"", |
There was a problem hiding this comment.
nit: again, perhaps this should say something like
| "cannot connect to serial console of instance in state \"{}\"", | |
| "cannot connect to serial console of instance with VMM in in state \"{}\"", |
given that we are referencing the VMM's state here rather than the instance's state?
There was a problem hiding this comment.
See above--in this case the VMM's state is the instance's state from an external caller's point of view.
| PropolisState::Failed => VmmState::Failed, | ||
| PropolisState::Destroyed => VmmState::Destroyed, | ||
| PropolisState::Repairing => { | ||
| unreachable!("Propolis doesn't use the Repairing state") |
There was a problem hiding this comment.
this match is kind of making me wonder whether we really even need to have separate PropolisInstanceState and VmmState types --- they're almost entirely the same except that PropolisInstanceState distinguishes between Creating and Starting, and has a Repairing state that apparently doesn't exist...
There was a problem hiding this comment.
I don't plan to make any changes along these lines in this PR, but a few things came to mind as I chewed on this comment:
-
I think it's worthwhile to have some play in the joints between the way Omicron represents VMM states and the way Propolis represents its own state--they're currently designed to track each other very closely, but I think I'd rather have flexibility (of being able to change the types Propolis exposes without changing the Omicron data model) we don't need than to need flexibility we don't have.
-
This specific case in the simulated sled agent is... weird. I'd say 90% of it is emulating Propolis's state machine, and 10% of it is enforcing rules about the state transitions Nexus is expected to request (see e.g. lines 106-118 of this file). I've long dreamt of getting rid of most of this logic and replacing it with a library/stub version of propolis-server's actual state machine (so that we don't have to keep the two state machines in sync) but haven't figured out how to do that yet.
-
The Repairing state is indeed unused. We (Oxide writ large, not just the two of us) should revisit RFD 315 to figure out whether we want to have this state at all. If we don't need it, we should cut it from Propolis's API definition, after which it can disappear here. (I used
unreachable!in this case because this is the simulation--IIRC in the real sled agent code we're prepared to accept a Repairing state from Propolis but will just convert it to a VMM state of Running.)
There was a problem hiding this comment.
Yup, okay, that makes sense. Thanks for summarizing your thoughts on this for me --- carry on!
I suppose it wouldn't make sense to add a SagaUnwound state to Propolis' type, now that I think of it!
|
The new test failure appears to be a minor test bug (there's a test that assumes it can move an instance to the |
| @@ -733,7 +735,7 @@ impl super::Nexus { | |||
| _ => { | |||
| return Err(Error::conflict(&format!( | |||
| "instance is in state {} but must be {} to be started", | |||
There was a problem hiding this comment.
My concern here is that I think this message is externally-visible (i.e. you will get it back as context if you make a start call that hits this error path), and we don't really have a first-class notion of a "VMM" in the external API. In cases like these the active VMM's state is the instance's state by definition, so I think it makes sense to refer to it as an instance state.
(That said, I know I've written other error messages that don't hold to the "don't talk about VMMs" rule, so it's probably just a suggestion at most...)
| | DbVmmState::Stopped | ||
| | DbVmmState::Failed => { | ||
| Err(Error::invalid_request(format!( | ||
| "cannot connect to serial console of instance in state \"{}\"", |
There was a problem hiding this comment.
See above--in this case the VMM's state is the instance's state from an external caller's point of view.
| "instance {} is in vmm state but has no valid vmm id", | ||
| authz_instance.id(), | ||
| ), | ||
| ))); | ||
| } | ||
| (InstanceState::Vmm, Some(VmmState::Destroyed)) => { | ||
| return Err(ActionError::action_failed(Error::internal_error( | ||
| &format!( | ||
| "instance {} points to destroyed vmm", |
hawkw
left a comment
There was a problem hiding this comment.
This looks good to me!
I do kind of think that we might want to have the ability to distinguish between "VMM processes" and "VM instances" in operator-facing error messages etc, given that things like migration are concepts that the operator is aware of. If migration is an operator-facing concept and not a totally hidden implementation detail, then the operator is implicitly aware that there is "something" that a VM migrates from/migrates to. But, that feels like a discussion for a later date, and I'm fine with leaving the error messages the way they are for now.
| PropolisState::Failed => VmmState::Failed, | ||
| PropolisState::Destroyed => VmmState::Destroyed, | ||
| PropolisState::Repairing => { | ||
| unreachable!("Propolis doesn't use the Repairing state") |
There was a problem hiding this comment.
Yup, okay, that makes sense. Thanks for summarizing your thoughts on this for me --- carry on!
I suppose it wouldn't make sense to add a SagaUnwound state to Propolis' type, now that I think of it!
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
| } | ||
|
|
||
| match collection.runtime_state.nexus_state { | ||
| state if SAFE_TRANSIENT_INSTANCE_STATES.contains(&state) => Error::unavail(&format!( |
There was a problem hiding this comment.
It looks like we're now delegating this part of the check to sagas::instance_common::instance_ip_get_instance_state, so this removal is fine I think. It allows through Rebooting now, but to be fair I only disallowed it because I wasn't sure how the VMM would handle that case -- happy to see that working!
There was a problem hiding this comment.
I meant to leave a review note here--thanks for looking at this part!
I reckoned that this check could never actually pass, because instances (in the DB) can never actually reach any of the SAFE_TRANSIENT_INSTANCE_STATES (only their VMMs do), but wasn't sure if it was actually safe not to check for this case (beyond the fact that we weren't already checking for it). If instance_ip_get_instance_state will cover it, I'll leave this case alone for now. (Otherwise we'll need to figure out how to get detach_resource to query and return VMM state along with the instance state, a problem that I've thought about before but have so far managed to defer actually solving...)
gjcolombo
left a comment
There was a problem hiding this comment.
Thanks @FelixMcFelix!
| } | ||
|
|
||
| match collection.runtime_state.nexus_state { | ||
| state if SAFE_TRANSIENT_INSTANCE_STATES.contains(&state) => Error::unavail(&format!( |
There was a problem hiding this comment.
I meant to leave a review note here--thanks for looking at this part!
I reckoned that this check could never actually pass, because instances (in the DB) can never actually reach any of the SAFE_TRANSIENT_INSTANCE_STATES (only their VMMs do), but wasn't sure if it was actually safe not to check for this case (beyond the fact that we weren't already checking for it). If instance_ip_get_instance_state will cover it, I'll leave this case alone for now. (Otherwise we'll need to figure out how to get detach_resource to query and return VMM state along with the instance state, a problem that I've thought about before but have so far managed to defer actually solving...)
This defines the new instance_state_v2/vmm_state enums and the update process to reach them. The dbinit_equals_sum_of_all_up test passes but I haven't checked anything else yet. Notably, the order of column definitions in the instance/vmm tables changed; this needs to be fixed in the Rust definitions.
Note that the *types* are still wrong (still need to define the enum types for instance_state_v2 and vmm); this just gets the columns into the right places.
As usual, some things are TODO'ed out around the openapi definitions. Other little issues here: - there are some bits of logic (tagged with REVIEW) where the code wasn't being crisp about whether it was looking at instance or VMM states. these will need some review to make sure I haven't broken them - is omicron_common::api::internal::nexus (and not ::shared) the correct place for the "internal API" version of the VMM state enum? Nevertheless: next thing here is to fix up all the OpenAPI stuff and then get sled agent building again.
The helper function that determines whether an instance IP attach/detach operation should send the IP change to the instance's current sled was a bit too permissive about how it handled the "Starting" state. Tighten this up, then fix the attach/detach tests so that they drive instances to the Running state (as they intend) before attaching/detaching any IPs.
The new "instances are in the VMM state iff they have a Propolis ID" check breaks db::queries::network_interface::tests::test_insert_running_instance_fails. This test suite assumes it can move an instance to the Vmm state and assign it a Propolis ID in separate statements. That's no longer allowed; do it in one statement instead.
0c55ef3 to
1d91b71
Compare
|
Hmm, not sure what's up with this test failure: https://buildomat.eng.oxide.computer/wg/0/details/01HZMKJFJEPQRXVX8KHZV391KT/ID1W3HcSco3isAWva2AKktziTeRN1d6V1vxdX2OXvM9BLTMZ/01HZMKJSQ5ZG32EC6JCAHZJ7CK#S4574 |
|
I missed one bit of schema change when merging with latest main (and then optimistically pushed without actually running tests on my local). Will have a fix up for that shortly (actually re-running the tests this time). |
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
This branch is part of ongoing work on the `instance-update` saga (see PR #5749); I've factored it out into a separate PR. This is largely because this branch makes mechanical changes to a bunch of different files that aren't directly related to the core change of #5749, and I'd like to avoid the additional merge conflicts that are likely when these changes remain un-merged for a long time. --- Depends on #5854 and should merge only after that PR. As part of #5749, it is desirable to distinguish between *why* a VMM record was marked as `Destroyed`, in order to determine which saga is responsible for cleaning up that VMM's resources. The `instance-update` saga must be the only entity that can set an instance's VMM IDs (active and target) to NULL. Presently, however, the `instance-start` and `instance-migrate` sagas may also do this when they unwind. This is a requirement to avoid situations where a failed `instance-update` saga cannot unwind, because the instance's generation number has changed while the saga was executing. We want to ensure that if a start or migrate request fails, the instance will appear to be in the correct post-state as soon as the relevant API call returns. In order to do this, without having to also guarantee that an instance update has occurred, we introduce a new VMM state, `SagaUnwound`, with the following rules: - It is legal to start or migrate an instance if the `active_propolis_id` or `destination_propolis_id` (respectively) is either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the new VMM ID directly replaces the old ID). - If an instance has an `active_propolis_id` in the `SagaUnwound` state, then the instance appears to be `Stopped`. - If an instance has a `destination_propolis_id` in the `SagaUnwound` state, nothing special happens–the instance’s state is still derived from its active VMM’s state. - The update saga treats `SagaUnwound` VMMs as identical to `Destroyed` VMMs for purposes of deciding whether to remove a VMM ID from an instance. This branch adds a new `VmmState::SagaUnwound` variant. The `SagaUnwound` state is an internal implementation detail that shouldn't be exposed to an operator or in the external API. Sled-agents will never report that a VMM is in this state. Instead, this state is set my the `instance-start` and `instance-migrate` sagas when they unwind. When determining the API instance state from an instance and active VMM query so that the `SagaUnwound` state is mapped to `Destroyed`. Closes #5848, which this replaces.
A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
Split the existing
instance_stateenum in the CRDB schema into separate instance state and VMM state enums and remove unused enum variants. Instances now have five states: Creating, NoVmm, Vmm, Failed, and Destroyed. VMMs have most of the states they had before, except that the unused Creating and Repairing states have been removed. This change makes it easier to add new states (e.g. SagaUnwound, see #5848) that apply only to instances/VMMs without having to update code that's working with the other type of DB record.Update the routines the IP attach/detach sagas use to decide where (if anywhere) to dispatch IP change messages. These routines were looking for instance states that weren't actually being used (the states of interest appear in the instances' active VMMs instead).
Add a data migration test to make sure that the Stopped and Running instance states are converted as expected.
Tests:
cargo nextest.