Skip to content

Split instance state enum into instance/VMM state enums#5854

Merged
gjcolombo merged 22 commits into
mainfrom
gjcolombo/separate-instance-vmm-state-enums
Jun 5, 2024
Merged

Split instance state enum into instance/VMM state enums#5854
gjcolombo merged 22 commits into
mainfrom
gjcolombo/separate-instance-vmm-state-enums

Conversation

@gjcolombo

Copy link
Copy Markdown
Contributor

Split the existing instance_state enum in the CRDB schema into separate instance state and VMM state enums and remove unused enum variants. Instances now have five states: Creating, NoVmm, Vmm, Failed, and Destroyed. VMMs have most of the states they had before, except that the unused Creating and Repairing states have been removed. This change makes it easier to add new states (e.g. SagaUnwound, see #5848) that apply only to instances/VMMs without having to update code that's working with the other type of DB record.

Update the routines the IP attach/detach sagas use to decide where (if anywhere) to dispatch IP change messages. These routines were looking for instance states that weren't actually being used (the states of interest appear in the instances' active VMMs instead).

Add a data migration test to make sure that the Stopped and Running instance states are converted as expected.

Tests: cargo nextest.

@gjcolombo

Copy link
Copy Markdown
Contributor Author

In addition to fixing up clippy, I want to see if I can add some DB constraints to require that an instance is in the Vmm state only if it has a non-NULL active_propolis_id. I'll move this out of draft once that's done.

@hawkw hawkw self-requested a review June 4, 2024 18:15
Comment thread nexus/src/app/instance.rs
@@ -733,7 +735,7 @@ impl super::Nexus {
_ => {
return Err(Error::conflict(&format!(
"instance is in state {} but must be {} to be started",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps this should explicitly say

Suggested change
"instance is in state {} but must be {} to be started",
"VMM is in state {} but must be {} to be started",

given that we're no longer referencing the public instance state type here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that I think this message is externally-visible (i.e. you will get it back as context if you make a start call that hits this error path), and we don't really have a first-class notion of a "VMM" in the external API. In cases like these the active VMM's state is the instance's state by definition, so I think it makes sense to refer to it as an instance state.

(That said, I know I've written other error messages that don't hold to the "don't talk about VMMs" rule, so it's probably just a suggestion at most...)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's fair. I guess I feel like it's a bit weird to be talking about an "instance" being in a state which is different from the externally visible list of "instance states" in the OpenAPI spec, but...I dunno.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point--if this is an external message it should refer to the externally-visible state names, not our inside-baseball data model names.

It's a bit crude, but we could just wrap the VMM state in an InstanceState::from before displaying it; that will at least produce the state name that the caller would have seen if it had done an instance_get that read the same instance/VMM states that the match operated on to make this decision. WDYT?

Comment thread nexus/db-queries/src/db/queries/network_interface.rs Outdated
Comment thread nexus/src/app/instance.rs
| DbVmmState::Stopped
| DbVmmState::Failed => {
Err(Error::invalid_request(format!(
"cannot connect to serial console of instance in state \"{}\"",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: again, perhaps this should say something like

Suggested change
"cannot connect to serial console of instance in state \"{}\"",
"cannot connect to serial console of instance with VMM in in state \"{}\"",

given that we are referencing the VMM's state here rather than the instance's state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above--in this case the VMM's state is the instance's state from an external caller's point of view.

Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread nexus/tests/integration_tests/schema.rs
Comment thread schema/crdb/separate-instance-and-vmm-states/README.adoc Outdated
Comment thread schema/crdb/separate-instance-and-vmm-states/README.adoc Outdated
Comment thread schema/crdb/separate-instance-and-vmm-states/up04.sql
Comment on lines +303 to +315
PropolisState::Failed => VmmState::Failed,
PropolisState::Destroyed => VmmState::Destroyed,
PropolisState::Repairing => {
unreachable!("Propolis doesn't use the Repairing state")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this match is kind of making me wonder whether we really even need to have separate PropolisInstanceState and VmmState types --- they're almost entirely the same except that PropolisInstanceState distinguishes between Creating and Starting, and has a Repairing state that apparently doesn't exist...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't plan to make any changes along these lines in this PR, but a few things came to mind as I chewed on this comment:

  • I think it's worthwhile to have some play in the joints between the way Omicron represents VMM states and the way Propolis represents its own state--they're currently designed to track each other very closely, but I think I'd rather have flexibility (of being able to change the types Propolis exposes without changing the Omicron data model) we don't need than to need flexibility we don't have.

  • This specific case in the simulated sled agent is... weird. I'd say 90% of it is emulating Propolis's state machine, and 10% of it is enforcing rules about the state transitions Nexus is expected to request (see e.g. lines 106-118 of this file). I've long dreamt of getting rid of most of this logic and replacing it with a library/stub version of propolis-server's actual state machine (so that we don't have to keep the two state machines in sync) but haven't figured out how to do that yet.

  • The Repairing state is indeed unused. We (Oxide writ large, not just the two of us) should revisit RFD 315 to figure out whether we want to have this state at all. If we don't need it, we should cut it from Propolis's API definition, after which it can disappear here. (I used unreachable! in this case because this is the simulation--IIRC in the real sled agent code we're prepared to accept a Repairing state from Propolis but will just convert it to a VMM state of Running.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, okay, that makes sense. Thanks for summarizing your thoughts on this for me --- carry on!

I suppose it wouldn't make sense to add a SagaUnwound state to Propolis' type, now that I think of it!

@gjcolombo gjcolombo marked this pull request as ready for review June 4, 2024 19:18
@gjcolombo gjcolombo requested a review from FelixMcFelix June 4, 2024 19:18
@gjcolombo

Copy link
Copy Markdown
Contributor Author

The new test failure appears to be a minor test bug (there's a test that assumes it can move an instance to the Vmm state and then assign its active Propolis ID in separate operations, but the new instance table constraint forbids this; evidently I missed running this test before pushing). Will fix presently.

@gjcolombo gjcolombo left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, @hawkw! (I have one more comment to get to but most of them are addressed in this batch.)

Comment thread nexus/src/app/instance.rs
@@ -733,7 +735,7 @@ impl super::Nexus {
_ => {
return Err(Error::conflict(&format!(
"instance is in state {} but must be {} to be started",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that I think this message is externally-visible (i.e. you will get it back as context if you make a start call that hits this error path), and we don't really have a first-class notion of a "VMM" in the external API. In cases like these the active VMM's state is the instance's state by definition, so I think it makes sense to refer to it as an instance state.

(That said, I know I've written other error messages that don't hold to the "don't talk about VMMs" rule, so it's probably just a suggestion at most...)

Comment thread nexus/db-queries/src/db/queries/network_interface.rs Outdated
Comment thread nexus/src/app/instance.rs
| DbVmmState::Stopped
| DbVmmState::Failed => {
Err(Error::invalid_request(format!(
"cannot connect to serial console of instance in state \"{}\"",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above--in this case the VMM's state is the instance's state from an external caller's point of view.

Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment on lines +321 to +329
"instance {} is in vmm state but has no valid vmm id",
authz_instance.id(),
),
)));
}
(InstanceState::Vmm, Some(VmmState::Destroyed)) => {
return Err(ActionError::action_failed(Error::internal_error(
&format!(
"instance {} points to destroyed vmm",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread schema/crdb/separate-instance-and-vmm-states/README.adoc Outdated
Comment thread schema/crdb/separate-instance-and-vmm-states/README.adoc Outdated

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

I do kind of think that we might want to have the ability to distinguish between "VMM processes" and "VM instances" in operator-facing error messages etc, given that things like migration are concepts that the operator is aware of. If migration is an operator-facing concept and not a totally hidden implementation detail, then the operator is implicitly aware that there is "something" that a VM migrates from/migrates to. But, that feels like a discussion for a later date, and I'm fine with leaving the error messages the way they are for now.

Comment on lines +303 to +315
PropolisState::Failed => VmmState::Failed,
PropolisState::Destroyed => VmmState::Destroyed,
PropolisState::Repairing => {
unreachable!("Propolis doesn't use the Repairing state")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, okay, that makes sense. Thanks for summarizing your thoughts on this for me --- carry on!

I suppose it wouldn't make sense to add a SagaUnwound state to Propolis' type, now that I think of it!

hawkw added a commit that referenced this pull request Jun 4, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
@hawkw hawkw mentioned this pull request Jun 4, 2024

@FelixMcFelix FelixMcFelix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for putting this together -- nits and truncated docs aside, I'm happy with how this interacts with IP attach/detach. The expanded commentary on the Starting state in particular is great.

}

match collection.runtime_state.nexus_state {
state if SAFE_TRANSIENT_INSTANCE_STATES.contains(&state) => Error::unavail(&format!(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're now delegating this part of the check to sagas::instance_common::instance_ip_get_instance_state, so this removal is fine I think. It allows through Rebooting now, but to be fair I only disallowed it because I wasn't sure how the VMM would handle that case -- happy to see that working!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to leave a review note here--thanks for looking at this part!

I reckoned that this check could never actually pass, because instances (in the DB) can never actually reach any of the SAFE_TRANSIENT_INSTANCE_STATES (only their VMMs do), but wasn't sure if it was actually safe not to check for this case (beyond the fact that we weren't already checking for it). If instance_ip_get_instance_state will cover it, I'll leave this case alone for now. (Otherwise we'll need to figure out how to get detach_resource to query and return VMM state along with the instance state, a problem that I've thought about before but have so far managed to defer actually solving...)

Comment thread nexus/src/app/sagas/instance_common.rs
Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread schema/crdb/dbinit.sql Outdated

@gjcolombo gjcolombo left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FelixMcFelix!

Comment thread nexus/src/app/sagas/instance_common.rs
}

match collection.runtime_state.nexus_state {
state if SAFE_TRANSIENT_INSTANCE_STATES.contains(&state) => Error::unavail(&format!(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to leave a review note here--thanks for looking at this part!

I reckoned that this check could never actually pass, because instances (in the DB) can never actually reach any of the SAFE_TRANSIENT_INSTANCE_STATES (only their VMMs do), but wasn't sure if it was actually safe not to check for this case (beyond the fact that we weren't already checking for it). If instance_ip_get_instance_state will cover it, I'll leave this case alone for now. (Otherwise we'll need to figure out how to get detach_resource to query and return VMM state along with the instance state, a problem that I've thought about before but have so far managed to defer actually solving...)

Comment thread nexus/src/app/sagas/instance_common.rs Outdated
Comment thread schema/crdb/dbinit.sql Outdated
gjcolombo added 18 commits June 5, 2024 15:54
This defines the new instance_state_v2/vmm_state enums and the update
process to reach them. The dbinit_equals_sum_of_all_up test passes but I
haven't checked anything else yet. Notably, the order of column
definitions in the instance/vmm tables changed; this needs to be fixed
in the Rust definitions.
Note that the *types* are still wrong (still need to define the enum
types for instance_state_v2 and vmm); this just gets the columns into
the right places.
As usual, some things are TODO'ed out around the openapi definitions.

Other little issues here:

- there are some bits of logic (tagged with REVIEW) where the code
  wasn't being crisp about whether it was looking at instance or VMM
  states. these will need some review to make sure I haven't broken them
- is omicron_common::api::internal::nexus (and not ::shared) the correct
  place for the "internal API" version of the VMM state enum?

Nevertheless: next thing here is to fix up all the OpenAPI stuff and
then get sled agent building again.
The helper function that determines whether an instance IP attach/detach
operation should send the IP change to the instance's current sled was a
bit too permissive about how it handled the "Starting" state. Tighten
this up, then fix the attach/detach tests so that they drive instances
to the Running state (as they intend) before attaching/detaching any
IPs.
The new "instances are in the VMM state iff they have a Propolis ID"
check breaks
db::queries::network_interface::tests::test_insert_running_instance_fails. This
test suite assumes it can move an instance to the Vmm state and assign it a
Propolis ID in separate statements. That's no longer allowed; do it in one
statement instead.
@gjcolombo gjcolombo force-pushed the gjcolombo/separate-instance-vmm-state-enums branch from 0c55ef3 to 1d91b71 Compare June 5, 2024 16:26
@hawkw

hawkw commented Jun 5, 2024

Copy link
Copy Markdown
Member

@gjcolombo

Copy link
Copy Markdown
Contributor Author

I missed one bit of schema change when merging with latest main (and then optimistically pushed without actually running tests on my local). Will have a fix up for that shortly (actually re-running the tests this time).

@gjcolombo gjcolombo enabled auto-merge (squash) June 5, 2024 18:38
@gjcolombo gjcolombo merged commit e488d57 into main Jun 5, 2024
@gjcolombo gjcolombo deleted the gjcolombo/separate-instance-vmm-state-enums branch June 5, 2024 20:16
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854 and should merge only after that PR.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
`active_propolis_id` or `destination_propolis_id` (respectively) is
either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the
new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
state, nothing special happens–the instance’s state is still derived
from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
VMMs for purposes of deciding whether to remove a VMM ID from an
instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
hawkw added a commit that referenced this pull request Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have
been observed. These include:

- Instances getting "stuck" in a transient state, such as `Starting` or
`Stopping`, with no way to forcibly terminate them (#4004)
- Race conditions between instances starting and receiving state
updates, which cause provisioning counters to underflow (#5042)
- Instances entering and exiting the `Failed` state when nothing is
actually wrong with them, potentially leaking virtual resources (#4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current
mechanisms for understanding and managing an instance's lifecycle state
machine are "kind of a mess". In particular:

- **(Conceptual) ownership of the CRDB `instance` record is currently
split between Nexus and sled-agent(s).** Although Nexus is the only
entity that actually reads or writes to the database, the instance's
runtime state is also modified by the sled-agents that manage its active
Propolis (and, if it's migrating, it's target Propolis), and written to
CRDB on their behalf by Nexus. This means that there are multiple copies
of the instance's state in different places at the same time, which can
potentially get out of sync. When an instance is migrating, its state is
updated by two different sled-agents, and they may potentially generate
state updates that conflict with each other. And, splitting the
responsibility between Nexus and sled-agent makes the code more complex
and harder to understand: there is no one place where all instance state
machine transitions are performed.
- **Nexus doesn't ensure that instance state updates are processed
reliably.** Instance state transitions triggered by user actions, such
as `instance-start` and `instance-delete`, are performed by distributed
sagas, ensuring that they run to completion even if the Nexus instance
executing them comes to an untimely end. This is *not* the case for
operations that result from instance state transitions reported by
sled-agents, which just happen in the HTTP APIs for reporting instance
states. If the Nexus processing such a transition crashes, gets network
partition'd, or encountering a transient error, the instance is left in
an incomplete state and the remainder of the operation will not be
performed.

This branch rewrites much of the control plane's instance state
management subsystem to resolve these issues. At a high level, it makes
the following high-level changes:

- **Nexus is now the sole owner of the `instance` record.** Sled-agents
no longer have their own copies of an instance's `InstanceRuntimeState`,
and do not generate changes to that state when reporting instance
observations to Nexus. Instead, the sled-agent only publishes updates to
the `vmm` and `migration` records (which are never modified by Nexus
directly) and Nexus is the only entity responsible for determining how
an instance's state should change in response to a VMM or migration
state update.
- **When an instance has an active VMM, its effective external state is
determined primarily by the active `vmm` record**, so that fewer state
transitions *require* changes to the `instance` record. PR #5854 laid
the ground work for this change, but it's relevant here as well.
- **All updates to an `instance` record (and resources conceptually
owned by that instance) are performed by a distributed saga.** I've
introduced a new `instance-update` saga, which is responsible for
performing all changes to the `instance` record, virtual provisioning
resources, and instance network config that are performed as part of a
state transition. Moving this to a saga helps us to ensure that these
operations are always run to completion, even in the event of a sudden
Nexus death.
- **Consistency of instance state changes is ensured by distributed
locking.** State changes may be published by multiple sled-agents to
different Nexus replicas. If one Nexus replica is processing a state
change received from a sled-agent, and then the instance's state changes
again, and the sled-agent publishes that state change to a *different*
Nexus...lots of bad things can happen, since the second state change may
be performed from the previous initial state, when it *should* have a
"happens-after" relationship with the other state transition. And, some
operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using
the first _distributed lock_ in the Oxide control plane, the "instance
updater lock". I introduced the locking primitives in PR #5831 --- see
that branch for more discussion of locking.
- **Background tasks are added to prevent missed updates**. To ensure we
cannot accidentally miss an instance update even if a Nexus dies, hits a
network partition, or just chooses to eat the state update accidentally,
we add a new `instance-updater` background task, which queries the
database for instances that are in states that require an update saga
without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

- An instance's active VMM transitions to `Destroyed`, in which case the
instance's virtual resources are cleaned up and the active VMM is
unlinked.
- Either side of an instance's live migration reports that the migration
has completed successfully.
- Either side of an instance's live migration reports that the migration
has failed.

The inner workings of the instance-update saga itself is fairly complex,
and has some kind of interesting idiosyncrasies relative to the existing
sagas. I've written up a [lengthy comment] that provides an overview of
the theory behind the design of the saga and its principles of
operation, so I won't reproduce that in this commit message.

[lengthy comment]:
https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants