Skip to content

[nexus] add instance-updater lock#5831

Merged
hawkw merged 8 commits into
mainfrom
eliza/just-the-locks
May 30, 2024
Merged

[nexus] add instance-updater lock#5831
hawkw merged 8 commits into
mainfrom
eliza/just-the-locks

Conversation

@hawkw

@hawkw hawkw commented May 29, 2024

Copy link
Copy Markdown
Member

In order to simplify the instance lifecycle state machine and ensure that instance state updates are processed reliably, we intend to perform instance state updates in a saga (which will be added in PR #5749). This saga will require a notion of mutual exclusion between update sagas for the same instance, in order to avoid race conditions like the following:

  1. Sagas S1 and S2 start at the same time and observe the same instance/VMM states, which indicate that the instance’s active VMM has shut down
  2. S1 clears all the resources/provisioning counters and marks the instance as `Stopped``
  3. User restarts the instance
  4. S2 clears the same instance provisioning counters again

Presently, these races are avoided by the fact that instance state updates are performed partially in sled-agent, which serves as an "actor" with exclusive ownership over the state transition. Moving these state transitions to Nexus requires introducing mutual exclusion.

This commit adds a distributed lock on instance state transitions to the datastore. We add the following fields to the instance table:

  • updater_id, which is the UUID of the saga currently holding the update lock on the instance (or NULL if no saga has locked the instance)
  • updater_gen, a generation counter that is incremented each time the lock is acquired by a new saga

Using these fields, we can add new datastore methods to try and acquire an instance update lock by doing the following:

  1. Generate a UUID for the saga, saga_lock_id. This will be performed in the saga itself and isn't part of this PR.
  2. Read the instance record and interpret the value of the updater_id field as follows:
  • NULL: lock not held, we can acquire it by incrementing the updater_gen field and setting the updater_id field to the saga's UUID.
  • updater_id == saga_id: the saga already holds the lock, we can proceed with the update.
  • updater_id != saga_id: another saga holds the lock, we can't proceed with the update. Fail the operation.
  1. Attempt to write back the updated instance record with generation incremented and the updater_id set to the saga's ID, conditional on the updater_gen field being equal to the ID that was read when read the instance record. This is equivalent to the atomic compare-and-swap operation that one might use to implement a non-distributed lock in a single address space.
  • If this fails because the generation number is outdated, try again (i.e. goto (2)).
  • If this succeeds, the lock was acquired successfully.

Additionally, we can add a method for unlocking an instance record by clearing the updater_id field and incrementing the updater_gen. This query is conditional on the updater_id field being equal to the saga's UUID, to prevent cases where we accidentally unlock an instance that was locked by a different saga.

Introducing this distributed lock is considered fairly safe, as it will only ever be acquired in a saga, and the reverse action for the saga action that acquires the lock will ensure that the lock is released if the saga unwinds. Essentially, this is equivalent to a RAII guard releasing a lock when a thread panics in a single-threaded Rust program.

Presently, none of these methods are actually used. The saga that uses them will be added in PR #5749. I've factored out this change into its own PR so that we can merge the foundation needed for that branch. Hopefully this makes the diff a bit smaller and easier to review, as well as decreasing merge conflict churn with the schema changes.

In order to simplify the instance lifecycle state machine and ensure
that instance state updates are processed reliably, we intend to perform
instance state updates in a saga (which will be added in PR #5749). This
saga will require a notion of mutual exclusion between update sagas for
the same instance, in order to avoid race conditions like the following:

1. Sagas `S1` and `S2` start at the same time and observe the same
   instance/VMM states, which indicate that the instance’s active VMM
   has shut down
2. `S1` clears all the resources/provisioning counters and marks the
   instance as `Stopped``
3. User restarts the instance
4. `S2` clears the same instance provisioning counters again

Presently, these races are avoided by the fact that instance state
updates are performed partially in `sled-agent`, which serves as an
"actor" with exclusive ownership over the state transition. Moving these
state transitions to Nexus requires introducing mutual exclusion.

This commit adds a distributed lock on instance state transitions to the
datastore. We add the following fields to the `instance` table:

- `updater_id`, which is the UUID of the saga currently holding the
  update lock on the instance (or `NULL` if no saga has locked the
  instance)
- `updater_gen`, a generation counter that is incremented each time the
  lock is acquired by a new saga

Using these fields, we can add new datastore methods to try and acquire
an instance update lock by doing the following:

1. Generate a UUID for the saga, `saga_lock_id`. This will be performed
   in the saga itself and isn't part of this PR.
2. Read the instance record and interpret the value of the `updater_id`
  field as follows:
  - `NULL`: lock not held, we can acquire it by incrementing the
    `updater_gen` field and setting the `updater_id` field to the saga's
    UUID.
  - `updater_id == saga_id`: the saga already holds the lock, we can
    proceed with the update.
  - `updater_id != saga_id`: another saga holds the lock, we can't
    proceed with the update. Fail the operation.
3. Attempt to write back the updated instance record with generation
   incremented and the `updater_id` set to the saga's ID, conditional on
   the `updater_gen` field being equal to the ID that was read when read
   the instance record. This is equivalent to the atomic compare-and-swap
   operation that one might use to implement a non-distributed lock in a
   single address space.
  - If this fails because the generation number is outdated, try again
    (i.e. goto (2)).
  - If this succeeds, the lock was acquired successfully.

Additionally, we can add a method for unlocking an instance record by
clearing the `updater_id` field and incrementing the `updater_gen`. This
query is conditional on the `updater_id` field being equal to the saga's
UUID, to prevent cases where we accidentally unlock an instance that was
locked by a different saga.

Introducing this distributed lock is considered fairly safe, as it will
only ever be acquired in a saga, and the reverse action for the saga
action that acquires the lock will ensure that the lock is released if
the saga unwinds. Essentially, this is equivalent to a RAII guard
releasing a lock when a thread panics in a single-threaded Rust program.

Presently, none of these methods are actually used. The saga that uses
them will be added in PR #5749. I've factored out this change into its
own PR so that we can merge the foundation needed for that branch.
Hopefully this makes the diff a bit smaller and easier to review, as
well as decreasing merge conflict churn with the schema changes.

@smklein smklein left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I know there has been a bunch of hesitation around the usage of distributed locks, but this implementation is clear, narrowly focused around sagas, and well thought-out!

Comment thread nexus/db-model/src/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-model/src/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs

@gjcolombo gjcolombo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor comment fix. I also left a comment discussing what should happen if the unlock statement gets a NotUpdatedButExists error, but until there are some callers of this function I'm not totally committed to making that case return an error (or blocking the PR for it).

Comment thread nexus/db-model/src/instance.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/instance.rs Outdated
} if found.runtime_state.updater_gen > current_gen => Ok(false),
// The instance exists, but the lock ID doesn't match our lock ID.
// This means we were trying to release a lock we never held, whcih
// is almost certainly a programmer error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or that someone stole the lock from the logically correct owner. (Should we also assert/log an error in instance_updater_lock if it observes that its attempt to lock returned UpdateStatus::Updated but the lock owner wasn't previously NULL?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably just make .filter(dsl::updater_id.is_null()) a condition of the query as well, honestly. The generation number should guard it, but there's probably no reason not to also check that the ID is null in the query.

/// # Arguments
///
/// - `authz_instance`: the instance to attempt to unlock
/// - `current_gen`: the generation number for the lock being released.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would possibly rename this to something like after_lock_acquired_gen and note explicitly that it needs to be the lock generation number that was produced by acquiring the lock. (For many sagas I expect it will be very easy to reach for the generation number that was in the instance record when the saga started.)

Maybe we should leverage the type system here: have instance_updater_lock return an opaque struct that contains the lock ID and the post-acquisition generation and make instance_updater_unlock take that type. Harder to mishandle the values that way. WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should leverage the type system here: have instance_updater_lock return an opaque struct that contains the lock ID and the post-acquisition generation and make instance_updater_unlock take that type. Harder to mishandle the values that way. WDYT?

I actually did this, originally, on the other branch! I removed it because I wasn't sure if a saga node's reverse action would be able to access the output of its forward action, so I wasn't sure if there was a way for the "release lock on unwind" reverse action to get the opaque token or not. If that's possible, I'll definitely try this approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because I wasn't sure if a saga node's reverse action would be able to access the output of its forward action

I believe you can; see e.g. the CREATE_VMM_RECORD step in the instance start saga, where sis_destroy_vmm_record refers to the vmm_record output from sis_create_vmm_record.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, wonderful. In that case, I'll see about making this change.

@hawkw hawkw force-pushed the eliza/just-the-locks branch from 8c8da83 to 060c50d Compare May 30, 2024 19:26
@hawkw hawkw enabled auto-merge (squash) May 30, 2024 19:28
@hawkw hawkw merged commit 7cbb7da into main May 30, 2024
@hawkw hawkw deleted the eliza/just-the-locks branch May 30, 2024 20:55
hawkw added a commit that referenced this pull request Jun 12, 2024
In #5831, I added the `updater_gen` and `updater_id` fields for the
instance-updater lock to the `InstanceRuntimeState` struct, based purely
on vibes: "They change at runtime, right? Therefore, they ought to be
'runtime state'...". It turns out that this is actually Super Ultra
Wrong, because any query which writes an `InstanceRuntimeState` without
paying attention to the lock fields can inadvertantly clobber them,
either releasing the lock or setting it back to a previous lock holder.

These fields should be on the `Instance` struct instead, to avoid this
kind of thing. This commit moves them to the correct place. I figured it
would be nice to land this separately from #5749, purely for the sake of
keeping that diff small.
hawkw added a commit that referenced this pull request Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have
been observed. These include:

- Instances getting "stuck" in a transient state, such as `Starting` or
`Stopping`, with no way to forcibly terminate them (#4004)
- Race conditions between instances starting and receiving state
updates, which cause provisioning counters to underflow (#5042)
- Instances entering and exiting the `Failed` state when nothing is
actually wrong with them, potentially leaking virtual resources (#4226)

These typically require support intervention to resolve.

Broadly , these issues exist because the control plane's current
mechanisms for understanding and managing an instance's lifecycle state
machine are "kind of a mess". In particular:

- **(Conceptual) ownership of the CRDB `instance` record is currently
split between Nexus and sled-agent(s).** Although Nexus is the only
entity that actually reads or writes to the database, the instance's
runtime state is also modified by the sled-agents that manage its active
Propolis (and, if it's migrating, it's target Propolis), and written to
CRDB on their behalf by Nexus. This means that there are multiple copies
of the instance's state in different places at the same time, which can
potentially get out of sync. When an instance is migrating, its state is
updated by two different sled-agents, and they may potentially generate
state updates that conflict with each other. And, splitting the
responsibility between Nexus and sled-agent makes the code more complex
and harder to understand: there is no one place where all instance state
machine transitions are performed.
- **Nexus doesn't ensure that instance state updates are processed
reliably.** Instance state transitions triggered by user actions, such
as `instance-start` and `instance-delete`, are performed by distributed
sagas, ensuring that they run to completion even if the Nexus instance
executing them comes to an untimely end. This is *not* the case for
operations that result from instance state transitions reported by
sled-agents, which just happen in the HTTP APIs for reporting instance
states. If the Nexus processing such a transition crashes, gets network
partition'd, or encountering a transient error, the instance is left in
an incomplete state and the remainder of the operation will not be
performed.

This branch rewrites much of the control plane's instance state
management subsystem to resolve these issues. At a high level, it makes
the following high-level changes:

- **Nexus is now the sole owner of the `instance` record.** Sled-agents
no longer have their own copies of an instance's `InstanceRuntimeState`,
and do not generate changes to that state when reporting instance
observations to Nexus. Instead, the sled-agent only publishes updates to
the `vmm` and `migration` records (which are never modified by Nexus
directly) and Nexus is the only entity responsible for determining how
an instance's state should change in response to a VMM or migration
state update.
- **When an instance has an active VMM, its effective external state is
determined primarily by the active `vmm` record**, so that fewer state
transitions *require* changes to the `instance` record. PR #5854 laid
the ground work for this change, but it's relevant here as well.
- **All updates to an `instance` record (and resources conceptually
owned by that instance) are performed by a distributed saga.** I've
introduced a new `instance-update` saga, which is responsible for
performing all changes to the `instance` record, virtual provisioning
resources, and instance network config that are performed as part of a
state transition. Moving this to a saga helps us to ensure that these
operations are always run to completion, even in the event of a sudden
Nexus death.
- **Consistency of instance state changes is ensured by distributed
locking.** State changes may be published by multiple sled-agents to
different Nexus replicas. If one Nexus replica is processing a state
change received from a sled-agent, and then the instance's state changes
again, and the sled-agent publishes that state change to a *different*
Nexus...lots of bad things can happen, since the second state change may
be performed from the previous initial state, when it *should* have a
"happens-after" relationship with the other state transition. And, some
operations may contradict each other when performed concurrently.

To prevent these race conditions, this PR has the dubious honor of using
the first _distributed lock_ in the Oxide control plane, the "instance
updater lock". I introduced the locking primitives in PR #5831 --- see
that branch for more discussion of locking.
- **Background tasks are added to prevent missed updates**. To ensure we
cannot accidentally miss an instance update even if a Nexus dies, hits a
network partition, or just chooses to eat the state update accidentally,
we add a new `instance-updater` background task, which queries the
database for instances that are in states that require an update saga
without such a saga running, and starts the requisite sagas.

Currently, the instance update saga runs in the following cases:

- An instance's active VMM transitions to `Destroyed`, in which case the
instance's virtual resources are cleaned up and the active VMM is
unlinked.
- Either side of an instance's live migration reports that the migration
has completed successfully.
- Either side of an instance's live migration reports that the migration
has failed.

The inner workings of the instance-update saga itself is fairly complex,
and has some kind of interesting idiosyncrasies relative to the existing
sagas. I've written up a [lengthy comment] that provides an overview of
the theory behind the design of the saga and its principles of
operation, so I won't reproduce that in this commit message.

[lengthy comment]:
https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants