Skip to content

A VMM in state 'starting' can receive requests#7398

Open
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:vmm_in_starting_can_receive_requests
Open

A VMM in state 'starting' can receive requests#7398
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:vmm_in_starting_can_receive_requests

Conversation

@jmpesp

@jmpesp jmpesp commented Jan 24, 2025

Copy link
Copy Markdown
Contributor

If a VMM is in state 'starting' and cannot activate a given disk, it will hang. Region replacement or region snapshot replacement may alter the VCR of that disk to one that could be activated, but the current drive saga will not attempt to send requests to a VMM in 'starting'. Fix this - any time a Propolis is expected to be there, it should be ok to receive these requests. Otherwise a repair can be stuck, and the VMM could fail to stop if it gets stuck deactivating too!

If a VMM is in state 'starting' and cannot activate a given disk, it
will hang. Region replacement or region snapshot replacement may alter
the VCR of that disk to one that *could* be activated, but the current
drive saga will not attempt to send requests to a VMM in 'starting'. Fix
this - any time a Propolis is expected to be there, it should be ok to
receive these requests. Otherwise a repair can be stuck, and the VMM
could fail to stop if it gets stuck deactivating too!
@gjcolombo

Copy link
Copy Markdown
Contributor

Have we verified empirically that region snapshot/replacement work as intended on a Propolis that hasn't yet reached Running? I'm not certain they will (and would need to go look carefully at the relevant Propolis code to check).

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gjcolombo --- it's worth making sure that propolis-server can actually handle region replacement requests in that state.

| VmmState::Rebooting
| VmmState::Starting => {
// Propolis server is expected to be there
// (eventually, in the case of "Starting"), and

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a VMM isn't in the Starting state unless the Propolis server process does exist, FWIW --- if memory serves, that's the distinction between Starting and Creating. It may not yet be incarnating an instance, though.

@gjcolombo

Copy link
Copy Markdown
Contributor

I looked at this a bit. In Propolis there are at least two things we need to consider:

  • Sled agent maps Propolis's "Creating" state to Omicron's "Starting" state, so it's possible that the Propolis API call will return a "no instance" error (it hasn't gotten far enough along to believe it has an active VMM yet). This is probably fine; if you can detect this error you can just retry.
  • Probably more significantly: Propolis queues VCR change requests to the worker task that's otherwise responsible for initializing VMs and changing their state. (This is so that we don't have to reason about what happens if you try to mutate an instance's configuration while it's migrating.) If a VM is stuck in Starting because it can't activate a Crucible disk, the region replacement that might fix the situation will queue up behind the "start the VM" task and won't actually resolve.

It might be possible to fix the latter issue, but it's probably going to take a fair amount of effort on the Propolis side of the house. (We'd also need to be sure that if we have a Crucible upstairs that's stuck in a Volume::activate, and we call Volume::target_replace on it, then that succeeds and allows the activate call to resolve.)

@gjcolombo

Copy link
Copy Markdown
Contributor

Filed oxidecomputer/propolis#841 to track the relevant Propolis enhancement.

@leftwo leftwo added the expunge expunge sled or disk issues label Mar 19, 2025
@leftwo leftwo added the Update System Replacing old bits with newer, cooler bits label Mar 22, 2025
@askfongjojo askfongjojo assigned jmpesp and hawkw and unassigned jmpesp Mar 25, 2025
@hawkw

hawkw commented Mar 26, 2025

Copy link
Copy Markdown
Member

@askfongjojo I believe this issue was resolved by @gjcolombo's work in oxidecomputer/propolis#873 --- right, Greg?

@hawkw

hawkw commented Mar 26, 2025

Copy link
Copy Markdown
Member

@askfongjojo I believe this issue was resolved by @gjcolombo's work in oxidecomputer/propolis#873 --- right, Greg?

Oh, whoops, this is a PR, not an issue. I'll take a look.

@hawkw hawkw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I believe that now that @gjcolombo's merged oxidecomputer/propolis#873 and oxidecomputer/propolis#880, a VMM in state 'starting' now actually can receive requests. :)

@hawkw

hawkw commented Jul 24, 2025

Copy link
Copy Markdown
Member

@jmpesp is this change still needed? It looks like the merge conflict is just rustfmt churn. As of Greg's work on oxidecomputer/propolis#880, Propolis should be able to handle region-replacement requests when in Starting without misbehaving --- do we send those region-replacement requests elsewhere now? If not, I think we should merge this change, right?

@davepacheco davepacheco added the Add/expunge sled (dap: using this for import into new add/expunge sled project) label Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Add/expunge sled (dap: using this for import into new add/expunge sled project) expunge expunge sled or disk issues Update System Replacing old bits with newer, cooler bits

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants