Skip to content

[nexus] Allow stopping Failed instances#6652

Merged
hawkw merged 2 commits into
mainfrom
eliza/stop-failing
Sep 24, 2024
Merged

[nexus] Allow stopping Failed instances#6652
hawkw merged 2 commits into
mainfrom
eliza/stop-failing

Conversation

@hawkw

@hawkw hawkw commented Sep 24, 2024

Copy link
Copy Markdown
Member

PR #6503 changed Nexus to attempt to automatically restart instances which are in the Failed state. Now that we do this, we should probably change the allowable instance state transitions to permit a user to stop an instance that is Failed, as a way to say "stop trying to restart this instance" (as Stopped instances are not restarted). This branch changes Nexus::instance_request_state and
select_instance_change_action to permit stopping a Failed instance.

Fixes #6640
Fixes #2825, along with #6455 (which allowed restarting Failed instances).

PR #6503 changed Nexus to attempt to automatically restart instances
which are in the `Failed` state. Now that we do this, we should probably
change the allowable instance state transitions to permit a user to stop
an instance that is `Failed`, as a way to say "stop trying to restart
this instance" (as `Stopped` instances are not restarted). This branch
changes `Nexus::instance_request_state` and
`select_instance_change_action` to permit stopping a `Failed` instance.

Fixes #6640

I believe this also fixes #2825, along with #6455 (which allowed
restarting `Failed` instances).
@hawkw hawkw requested a review from gjcolombo September 24, 2024 17:27
@hawkw hawkw self-assigned this Sep 24, 2024
Comment thread nexus/src/app/instance.rs
@hawkw

hawkw commented Sep 24, 2024

Copy link
Copy Markdown
Member Author

@david-crespo I was just about to leave a comment letting you know about this change, but it looks like you were way ahead of me with oxidecomputer/console#2468. Nice :)

@david-crespo

Copy link
Copy Markdown
Contributor

My email inbox is a nightmare but I am up to date

@hawkw

hawkw commented Sep 24, 2024

Copy link
Copy Markdown
Member Author

The CI failure is due to an unexpected Tokio task cancellation in test_omdb_success_cases, which is a known test flake --- see #6505. I'm going to restart that run.

hawkw added a commit that referenced this pull request Sep 24, 2024
Currently, instances whose active VMM is `SagaUnwound` appear externally
as `Stopped`. We decided to report them as `Stopped` because start sagas
are permitted to run for instances with `SagaUnwound` active VMMs, and
--- at the time when the `SagaUnwound` VMM state was introduced,
`Failed` instances could not be started. However, #6455 added the
ability to restart `Failed` instances, and #6652 will permit them to be
stopped. Therefore, we should recast instances with `SagaUnwound` active
VMMs as `Failed`: they weren't asked politely to stop; instead, we
attempted to start them and something went wrong...which sounds like
`Failed` to me.

This becomes more important in light of #6638: if we will attempt
automatically restart such instances, they should definitely appear
to be `Failed`. The distinction between `Failed` and `Stopped` becomes
that `Failed` means "this thing isn't running, but it's supposed to be;
we may try to fix that for you if permitted to do so", while `Stopped`
means "this thing isn't running and that's fine, because you asked for
it to no longer be running". Thus, this commit changes `SagaUnwound`
VMMs to appear `Failed` externally.
Comment thread nexus/src/app/instance.rs
Comment thread nexus/src/app/instance.rs
Comment thread nexus/src/app/instance.rs
@hawkw hawkw enabled auto-merge (squash) September 24, 2024 22:22
@hawkw hawkw merged commit 0c7fb27 into main Sep 24, 2024
@hawkw hawkw deleted the eliza/stop-failing branch September 24, 2024 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[nexus] want to allow stopping Failed instances Failed instances should be allowed to stop and restart

3 participants