Skip to content

[nexus] Reincarnate instances with SagaUnwound VMMs#6669

Merged
hawkw merged 13 commits into
mainfrom
eliza/saga-rewound
Sep 27, 2024
Merged

[nexus] Reincarnate instances with SagaUnwound VMMs#6669
hawkw merged 13 commits into
mainfrom
eliza/saga-rewound

Conversation

@hawkw

@hawkw hawkw commented Sep 25, 2024

Copy link
Copy Markdown
Member

When an instance-start saga unwinds, any VMM it created transitions to
the SagaUnwound state. This causes the instance's effective state to
appear as Failed in the external API. PR #6503 added functionality to
Nexus to automatically restart instances that are in the Failed state
("instance reincarnation"). However, the current instance-reincarnation
task will not automatically restart instances whose instance-start
sagas have unwound, because such instances are not actually in the
Failed state from Nexus' perspective.

This PR implements reincarnation for instances whose instance-start
sagas have failed. This is done by changing the instance_reincarnation
background task to query the database for instances which have
SagaUnwound active VMMs, and then run instance-start sagas for them
identically to how it runs start sagas for Failed instances.

I decided to perform two separate queries to list Failed instances and
to list instances with SagaUnwound VMMs, because the SagaUnwound
query requires a join with the vmm table, and I thought it was a bit
nicer to be able to find Failed instances without having to do the
join, and only do it when looking for SagaUnwound ones. Also, having
two queries makes it easier to distinguish between Failed and
SagaUnwound instances in logging and the OMDB status output. This
ended up being implemented by adding a parameter to the
DataStore::find_reincarnatable_instances method that indicates which
category of instances to select; I had previously considered making the
method on the InstanceReincarnation struct that finds instances and
reincarnates them take the query as a Fn taking the datastore and
DataPageParams and returning an impl Future outputting
Result<Vec<Instance>, ...>,but figuring out generic lifetimes for the
pagination stuff was annoying enough that this felt like the simpler
choice.

Fixes #6638

@hawkw hawkw added the nexus Related to nexus label Sep 25, 2024
@hawkw hawkw requested a review from gjcolombo September 25, 2024 20:25
@hawkw hawkw self-assigned this Sep 25, 2024
@hawkw

hawkw commented Sep 25, 2024

Copy link
Copy Markdown
Member Author

When merging this, we should also be sure to merge #6658, since otherwise,
SagaUnwound instances will appear as Stopped in the external API even though
we will automatically restart them, which feels weird

since we print these in OMDB, it breaks the success cases expectorate
tests to use unordered hashmaps...
i dont know whats wrong with me
Comment thread nexus/src/app/background/tasks/instance_reincarnation.rs Outdated
Co-authored-by: Greg Colombo <greg@oxidecomputer.com>
@hawkw hawkw enabled auto-merge (squash) September 27, 2024 17:00
@hawkw

hawkw commented Sep 27, 2024

Copy link
Copy Markdown
Member Author

Well that's extremely spooky, it looks like this worked fine on commit 0b7f72e but then somehow broke on commit 8f89106: https://buildomat.eng.oxide.computer/wg/0/details/01J8T6F6B4TYVZVGS9NVY6RXJ8/m4ivC9CI7YNrcLE1S1dUTosEDmIl3bfax3fd4qNVIe7XiKua/01J8T6G2PKB9TADGAMV5DAR8R8

@hawkw

hawkw commented Sep 27, 2024

Copy link
Copy Markdown
Member Author

(also, it occurred to me that we probably want to make unwinding start sagas check if they should immediately kick the reincarnation task...)

@hawkw

hawkw commented Sep 27, 2024

Copy link
Copy Markdown
Member Author

Aaaand it passes on my machine:

    Finished `test` profile [unoptimized + debuginfo] target(s) in 3m 35s
------------
 Nextest run ID a90e00f1-faac-4db4-82a3-f32ca87dd2bd with nextest profile: default
    Starting 3 tests across 162 binaries (1536 tests and 5 binaries skipped, including 5 binaries via profile.default.default-filter)
       SETUP [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
             [ 00:00:00] [                   ]    0/1539:      
   Compiling nexus-config v0.1.0 (/home/eliza/Code/oxide/omicron/nexus-config)
   Compiling omicron-test-utils v0.1.0 (/home/eliza/Code/oxide/omicron/test-utils)
   Compiling crdb-seed v0.1.0 (/home/eliza/Code/oxide/omicron/dev-tools/crdb-seed)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 3.76s
     Running `target/debug/crdb-seed`
Sep 27 18:54:25.474 INFO Using existing CRDB seed tarball: `/tmp/crdb-base-eliza/7888c2fb782f3500cf5404b9680c8152f4554f551acee683d7a98ec218b57e57.tar`
  SETUP PASS [      1/1] crdb-seed: cargo run -p crdb-seed --profile test
        PASS [  14.789s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_reincarnates_failed_instances
        PASS [  28.533s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_cooldown_on_subsequent_reincarnations
        PASS [  29.511s] omicron-nexus app::background::tasks::instance_reincarnation::test::test_only_reincarnates_eligible_instances
------------
     Summary [  33.341s] 3 tests run: 3 passed, 1536 skipped

I bet this is a race between periodic and explicit activations of the reincarnation task. Cool.

@hawkw hawkw force-pushed the eliza/saga-rewound branch from 0645a37 to 19f9f16 Compare September 27, 2024 20:37
@hawkw hawkw merged commit d7235b8 into main Sep 27, 2024
@hawkw hawkw deleted the eliza/saga-rewound branch September 27, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reincarnate instances with SagaUnwound VMMs

2 participants