[nexus] add instance-updater lock#5831
Conversation
In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following: 1. Sagas `S1` and `S2` start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down 2. `S1` clears all the resources/provisioning counters and marks the instance as `Stopped`` 3. User restarts the instance 4. `S2` clears the same instance provisioning counters again Presently, these races are avoided by the fact that instance state updates are performed partially in `sled-agent`, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion. This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the `instance` table: - `updater_id`, which is the UUID of the saga currently holding the update lock on the instance (or `NULL` if no saga has locked the instance) - `updater_gen`, a generation counter that is incremented each time the lock is acquired by a new saga Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following: 1. Generate a UUID for the saga, `saga_lock_id`. This will be performed in the saga itself and isn't part of this PR. 2. Read the instance record and interpret the value of the `updater_id` field as follows: - `NULL`: lock not held, we can acquire it by incrementing the `updater_gen` field and setting the `updater_id` field to the saga's UUID. - `updater_id == saga_id`: the saga already holds the lock, we can proceed with the update. - `updater_id != saga_id`: another saga holds the lock, we can't proceed with the update. Fail the operation. 3. Attempt to write back the updated instance record with generation incremented and the `updater_id` set to the saga's ID, conditional on the `updater_gen` field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space. - If this fails because the generation number is outdated, try again (i.e. goto (2)). - If this succeeds, the lock was acquired successfully. Additionally, we can add a method for unlocking an instance record by clearing the `updater_id` field and incrementing the `updater_gen`. This query is conditional on the `updater_id` field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga. Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program. Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.
smklein
left a comment
There was a problem hiding this comment.
Looks great. I know there has been a bunch of hesitation around the usage of distributed locks, but this implementation is clear, narrowly focused around sagas, and well thought-out!
gjcolombo
left a comment
There was a problem hiding this comment.
LGTM with one minor comment fix. I also left a comment discussing what should happen if the unlock statement gets a NotUpdatedButExists error, but until there are some callers of this function I'm not totally committed to making that case return an error (or blocking the PR for it).
| } if found.runtime_state.updater_gen > current_gen => Ok(false), | ||
| // The instance exists, but the lock ID doesn't match our lock ID. | ||
| // This means we were trying to release a lock we never held, whcih | ||
| // is almost certainly a programmer error. |
There was a problem hiding this comment.
Or that someone stole the lock from the logically correct owner. (Should we also assert/log an error in instance_updater_lock if it observes that its attempt to lock returned UpdateStatus::Updated but the lock owner wasn't previously NULL?)
There was a problem hiding this comment.
We could probably just make .filter(dsl::updater_id.is_null()) a condition of the query as well, honestly. The generation number should guard it, but there's probably no reason not to also check that the ID is null in the query.
| /// # Arguments | ||
| /// | ||
| /// - `authz_instance`: the instance to attempt to unlock | ||
| /// - `current_gen`: the generation number for the lock being released. |
There was a problem hiding this comment.
I would possibly rename this to something like after_lock_acquired_gen and note explicitly that it needs to be the lock generation number that was produced by acquiring the lock. (For many sagas I expect it will be very easy to reach for the generation number that was in the instance record when the saga started.)
Maybe we should leverage the type system here: have instance_updater_lock return an opaque struct that contains the lock ID and the post-acquisition generation and make instance_updater_unlock take that type. Harder to mishandle the values that way. WDYT?
There was a problem hiding this comment.
Maybe we should leverage the type system here: have
instance_updater_lockreturn an opaque struct that contains the lock ID and the post-acquisition generation and makeinstance_updater_unlocktake that type. Harder to mishandle the values that way. WDYT?
I actually did this, originally, on the other branch! I removed it because I wasn't sure if a saga node's reverse action would be able to access the output of its forward action, so I wasn't sure if there was a way for the "release lock on unwind" reverse action to get the opaque token or not. If that's possible, I'll definitely try this approach.
There was a problem hiding this comment.
I removed it because I wasn't sure if a saga node's reverse action would be able to access the output of its forward action
I believe you can; see e.g. the CREATE_VMM_RECORD step in the instance start saga, where sis_destroy_vmm_record refers to the vmm_record output from sis_create_vmm_record.
There was a problem hiding this comment.
Ah, wonderful. In that case, I'll see about making this change.
8c8da83 to
060c50d
Compare
In #5831, I added the `updater_gen` and `updater_id` fields for the instance-updater lock to the `InstanceRuntimeState` struct, based purely on vibes: "They change at runtime, right? Therefore, they ought to be 'runtime state'...". It turns out that this is actually Super Ultra Wrong, because any query which writes an `InstanceRuntimeState` without paying attention to the lock fields can inadvertantly clobber them, either releasing the lock or setting it back to a previous lock holder. These fields should be on the `Instance` struct instead, to avoid this kind of thing. This commit moves them to the correct place. I figured it would be nice to land this separately from #5749, purely for the sake of keeping that diff small.
A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following:
S1andS2start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut downS1clears all the resources/provisioning counters and marks the instance as `Stopped``S2clears the same instance provisioning counters againPresently, these races are avoided by the fact that instance state updates are performed partially in
sled-agent, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion.This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the
instancetable:updater_id, which is the UUID of the saga currently holding the update lock on the instance (orNULLif no saga has locked the instance)updater_gen, a generation counter that is incremented each time the lock is acquired by a new sagaUsing these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following:
saga_lock_id. This will be performed in the saga itself and isn't part of this PR.updater_idfield as follows:NULL: lock not held, we can acquire it by incrementing theupdater_genfield and setting theupdater_idfield to the saga's UUID.updater_id == saga_id: the saga already holds the lock, we can proceed with the update.updater_id != saga_id: another saga holds the lock, we can't proceed with the update. Fail the operation.updater_idset to the saga's ID, conditional on theupdater_genfield being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space.Additionally, we can add a method for unlocking an instance record by clearing the
updater_idfield and incrementing theupdater_gen. This query is conditional on theupdater_idfield being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga.Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program.
Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.