You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a Result<Option<SledInstanceState>, Error>. Nexus will then get back one of the following values:
Ok(None) if the call succeeded and this didn't change what sled agent thinks the instance/VMM state is
Ok(Some(new_state)) if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDB
Err(e) if the call failed, where e: progenitor_client::Error<sled_agent_client::Error>
The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:
// The sled-agent has told us that it can't do what we
// requested, but does that mean a failure? One example would be
// if we try to "reboot" a stopped instance. That shouldn't
// transition the instance to failed. But if the sled-agent
// *can't* boot a stopped instance, that should transition
// to failed.
//
// Without a richer error type, let the sled-agent tell Nexus
// what to do with status codes.
error!(self.log,"saw {} from instance_put!", e);
// Convert to the Omicron API error type.
//
// N.B. The match below assumes that this conversion will turn
// any 400-level error status from sled agent into an
// `Error::InvalidRequest`.
let e = e.into();
match&e {
// Bad request shouldn't change the instance state.
Error::InvalidRequest{ .. } => Err(e),
// Internal server error (or anything else) should change
// the instance state to failed, we don't know what state
// the instance is in.
_ => {
let new_runtime = db::model::InstanceRuntimeState{
state: db::model::InstanceState::new(
InstanceState::Failed,
),
gen: db_instance.runtime_state.gen.next().into(),
..db_instance.runtime_state.clone()
};
// XXX what if this fails?
let result = self
.db_datastore
.instance_update_runtime(
&db_instance.id(),
&new_runtime,
)
.await;
error!(
self.log,
"saw {:?} from setting InstanceState::Failed after bad instance_put",
result,
);
Err(e)
}
}
Nexus will take the Progenitor error it got back, feed it through a From impl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.
Progenitor errors that didn't produce a response from sled agent--e.g. disconnections due to timeouts or network weather--all get turned into internal errors. These will cause an instance to move to Failed even if the problem was completely transient.
This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).
Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.
We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.
Follow-up from the refactoring in #4194.
If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a
Result<Option<SledInstanceState>, Error>. Nexus will then get back one of the following values:Ok(None)if the call succeeded and this didn't change what sled agent thinks the instance/VMM state isOk(Some(new_state))if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDBErr(e)if the call failed, wheree: progenitor_client::Error<sled_agent_client::Error>The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:
omicron/nexus/src/app/instance.rs
Lines 891 to 943 in 2bb0fdc
Nexus will take the Progenitor error it got back, feed it through a
Fromimpl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.The relevant From impl is here:
omicron/common/src/api/external/error.rs
Lines 355 to 418 in c030293
Notice the following:
This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).
Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.
We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.