Skip to content

[nexus] add InstanceState::SagaUnwound#5848

Closed
hawkw wants to merge 6 commits into
mainfrom
eliza/saga-unwound
Closed

[nexus] add InstanceState::SagaUnwound#5848
hawkw wants to merge 6 commits into
mainfrom
eliza/saga-unwound

Conversation

@hawkw

@hawkw hawkw commented May 31, 2024

Copy link
Copy Markdown
Member

This branch is part of ongoing work on the instance-update saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.


As part of #5749, it is desirable to distinguish between why a VMM
record was marked as Destroyed, in order to determine which saga is
responsible for cleaning up that VMM's resources. The instance-update
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the instance-start and
instance-migrate sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed instance-update saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
SagaUnwound, with the following rules:

  • It is legal to start or migrate an instance if the
    active_propolis_id or destination_propolis_id (respectively) is
    either NULL or refers to a VMM that’s in the SagaUnwound state
    (the new VMM ID directly replaces the old ID).
  • If an instance has an active_propolis_id in the SagaUnwound state,
    then the instance appears to be Stopped.
  • If an instance has a destination_propolis_id in the SagaUnwound
    state, nothing special happens–the instance’s state is still derived
    from its active VMM’s state.
  • The update saga treats SagaUnwound VMMs as identical to Destroyed
    VMMs for purposes of deciding whether to remove a VMM ID from an
    instance.

This branch introduces the SagaUnwound state. The SagaUnwound state
is an internal implementation detail that shouldn't be exposed to an
operator or in the external API. Therefore, I've chosen to separate the
enums representing the external API InstanceState type and the
internal (database) InstanceState, so that we don't have to add an
InstanceState::SagaUnwound variant to the public OpenAPI spec which
will never actually be returned. This, unfortunately, requires touching
a bunch of code, but the change is mostly mechanical. I've also updated
the code for determining the external API instance state from an
instance and active VMM query so that the SagaUnwound state is mapped
to Destroyed publicly, and instances whose active VMMs are Stopped
remain in the Stopping state (rather than Stopped), until a new
start saga for that instance is actually able to execute.

hawkw and others added 6 commits May 31, 2024 12:05
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.
We needed to add a filter to the view that prevents v2p mappings for
inactive instances from showing up in the v2p_mapping view.
@hawkw hawkw requested review from gjcolombo and smklein May 31, 2024 23:45
@hawkw hawkw changed the title [WIP][nexus] add InstanceState::SagaUnwound [nexus] add InstanceState::SagaUnwound May 31, 2024
@hawkw hawkw marked this pull request as ready for review May 31, 2024 23:46
hawkw added a commit that referenced this pull request Jun 4, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
@hawkw hawkw mentioned this pull request Jun 4, 2024
gjcolombo added a commit that referenced this pull request Jun 5, 2024
Split the existing `instance_state` enum in the CRDB schema into
separate instance state and VMM state enums and remove unused enum
variants. Instances now have five states: Creating, NoVmm, Vmm, Failed,
and Destroyed. VMMs have most of the states they had before, except that
the unused Creating and Repairing states have been removed. This change
makes it easier to add new states (e.g. SagaUnwound, see #5848) that
apply only to instances/VMMs without having to update code that's
working with the other type of DB record.

Update the routines the IP attach/detach sagas use to decide where (if
anywhere) to dispatch IP change messages. These routines were looking
for instance states that weren't actually being used (the states of
interest appear in the instances' active VMMs instead).

Add a data migration test to make sure that the Stopped and Running
instance states are converted as expected.

Tests: `cargo nextest`.
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
  `active_propolis_id` or `destination_propolis_id` (respectively) is
  either `NULL` or refers to a VMM that’s in the `SagaUnwound` state
  (the new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
  then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
  state, nothing special happens–the instance’s state is still derived
  from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
  VMMs for purposes of deciding whether to remove a VMM ID from an
  instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
@hawkw hawkw closed this in #5855 Jun 5, 2024
hawkw added a commit that referenced this pull request Jun 5, 2024
This branch is part of ongoing work on the `instance-update` saga (see
PR #5749); I've factored it out into a separate PR. This is largely
because this branch makes mechanical changes to a bunch of different
files that aren't directly related to the core change of #5749, and I'd
like to avoid the additional merge conflicts that are likely when these
changes remain un-merged for a long time.

---

Depends on #5854 and should merge only after that PR.

As part of #5749, it is desirable to distinguish between *why* a VMM
record was marked as `Destroyed`, in order to determine which saga is
responsible for cleaning up that VMM's resources. The `instance-update`
saga must be the only entity that can set an instance's VMM IDs (active
and target) to NULL. Presently, however, the `instance-start` and
`instance-migrate` sagas may also do this when they unwind. This is a
requirement to avoid situations where a failed `instance-update` saga
cannot unwind, because the instance's generation number has changed
while the saga was executing.

We want to ensure that if a start or migrate request fails, the instance
will appear to be in the correct post-state as soon as the relevant API
call returns. In order to do this, without having to also guarantee that
an instance update has occurred, we introduce a new VMM state,
`SagaUnwound`, with the following rules:

- It is legal to start or migrate an instance if the
`active_propolis_id` or `destination_propolis_id` (respectively) is
either `NULL` or refers to a VMM that’s in the `SagaUnwound` state (the
new VMM ID directly replaces the old ID).
- If an instance has an `active_propolis_id` in the `SagaUnwound` state,
then the instance appears to be `Stopped`.
- If an instance has a `destination_propolis_id` in the `SagaUnwound`
state, nothing special happens–the instance’s state is still derived
from its active VMM’s state.
- The update saga treats `SagaUnwound` VMMs as identical to `Destroyed`
VMMs for purposes of deciding whether to remove a VMM ID from an
instance.

This branch adds a new `VmmState::SagaUnwound` variant. The
`SagaUnwound` state is an internal implementation detail that shouldn't
be exposed to an operator or in the external API. Sled-agents will never
report that a VMM is in this state. Instead, this state is set my the
`instance-start` and `instance-migrate` sagas when they unwind. When
determining the API instance state from an instance and active VMM query
so that the `SagaUnwound` state is mapped to `Destroyed`.

Closes #5848, which this replaces.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants