This was seen in stress testing of PR #5749. That PR seems to hit the problem more often than main does, for reasons we're still investigating, but it looks like it's also a bug on current main so I'm filing a separate issue for it.
This is a regression I introduced in #4194. Nexus::instance_start previously rejected attempts to start a Creating instance:
|
// If the instance is already starting or running, succeed immediately |
|
// for idempotency. If the instance is stopped, try to start it. In all |
|
// other cases return an error describing the state conflict. |
|
// |
|
// The "Creating" state is not permitted here (even though a request to |
|
// create can include a request to start the instance) because an |
|
// instance that is still being created may not be ready to start yet |
|
// (e.g. its disks may not yet be attached). |
|
// |
|
// If the instance is stopped, the start saga will try to change the |
|
// instance's state to Starting and increment the instance's state |
|
// generation number. If this increment fails (because someone else has |
|
// changed the state), the saga fails. See the saga comments for more |
|
// details on how this synchronization works. |
|
match db_instance.runtime_state.state.0 { |
|
InstanceState::Starting | InstanceState::Running => { |
|
return Ok(db_instance) |
|
} |
|
InstanceState::Stopped => {} |
|
_ => { |
|
return Err(Error::conflict(&format!( |
|
"instance is in state {} but must be {} to be started", |
|
db_instance.runtime_state.state.0, |
|
InstanceState::Stopped |
|
))) |
|
} |
|
} |
|
|
|
let saga_params = sagas::instance_start::Params { |
|
serialized_authn: authn::saga::Serialized::for_opctx(opctx), |
|
instance: db_instance, |
|
ensure_network: true, |
|
}; |
|
|
|
self.execute_saga::<sagas::instance_start::SagaInstanceStart>( |
|
saga_params, |
|
) |
|
.await?; |
In 4194 I changed this path so that it checks the state of the instance's VMM but does not pay attention to the state of the instance:
|
let state = self |
|
.db_datastore |
|
.instance_fetch_with_vmm(opctx, &authz_instance) |
|
.await?; |
|
let (instance, vmm) = (state.instance(), state.vmm()); |
|
|
|
if let Some(vmm) = vmm { |
|
match vmm.runtime.state.0 { |
|
InstanceState::Starting |
|
| InstanceState::Running |
|
| InstanceState::Rebooting => { |
|
debug!(self.log, "asked to start an active instance"; |
|
"instance_id" => %authz_instance.id()); |
|
|
|
return Ok(state); |
|
} |
|
InstanceState::Stopped => { |
|
let propolis_id = instance |
|
.runtime() |
|
.propolis_id |
|
.expect("needed a VMM ID to fetch a VMM record"); |
|
error!(self.log, |
|
"instance is stopped but still has an active VMM"; |
|
"instance_id" => %authz_instance.id(), |
|
"propolis_id" => %propolis_id); |
|
|
|
return Err(Error::internal_error( |
|
"instance is stopped but still has an active VMM", |
|
)); |
|
} |
|
_ => { |
|
return Err(Error::conflict(&format!( |
|
"instance is in state {} but must be {} to be started", |
|
vmm.runtime.state.0, |
|
InstanceState::Stopped |
|
))); |
|
} |
|
} |
|
} |
|
|
|
let saga_params = sagas::instance_start::Params { |
|
serialized_authn: authn::saga::Serialized::for_opctx(opctx), |
|
db_instance: instance.clone(), |
|
}; |
|
|
|
self.execute_saga::<sagas::instance_start::SagaInstanceStart>( |
|
saga_params, |
|
) |
|
.await?; |
The if let Some(vmm) clause should have an else that checks to make sure that the instance is actually in a state where it's ready to be started (i.e. it should be in NoVmm).
At first glance the impact of this seems pretty limited: to hit it, you need to create an instance, then try to start it by name before the create saga is finished. The result that we're seeing is that the start request 500s; the other possible result, I think, is that the VM will start without all the virtual hardware you might expect, which you can recover from by stopping and restarting the instance.
This was seen in stress testing of PR #5749. That PR seems to hit the problem more often than main does, for reasons we're still investigating, but it looks like it's also a bug on current main so I'm filing a separate issue for it.
This is a regression I introduced in #4194.
Nexus::instance_startpreviously rejected attempts to start a Creating instance:omicron/nexus/src/app/instance.rs
Lines 483 to 520 in 79fee6a
In 4194 I changed this path so that it checks the state of the instance's VMM but does not pay attention to the state of the instance:
omicron/nexus/src/app/instance.rs
Lines 533 to 581 in acbaf43
The
if let Some(vmm)clause should have anelsethat checks to make sure that the instance is actually in a state where it's ready to be started (i.e. it should be inNoVmm).At first glance the impact of this seems pretty limited: to hit it, you need to create an instance, then try to start it by name before the create saga is finished. The result that we're seeing is that the start request 500s; the other possible result, I think, is that the VM will start without all the virtual hardware you might expect, which you can recover from by stopping and restarting the instance.