Skip to content

Revisit moving instances to Failed in handle_instance_put_result #4226

@gjcolombo

Description

@gjcolombo

Follow-up from the refactoring in #4194.

If Nexus calls sled agent to do something that registers an instance or changes its state, sled agent will try to return a Result<Option<SledInstanceState>, Error>. Nexus will then get back one of the following values:

  • Ok(None) if the call succeeded and this didn't change what sled agent thinks the instance/VMM state is
  • Ok(Some(new_state)) if the call succeeded and sled agent wants Nexus to try to write some new state back to CRDB
  • Err(e) if the call failed, where e: progenitor_client::Error<sled_agent_client::Error>

The first two cases are handled by simply trying to write a state update to CRDB and aren't important here. The error case, however, walks a tightrope:

Err(e) => {
// The sled-agent has told us that it can't do what we
// requested, but does that mean a failure? One example would be
// if we try to "reboot" a stopped instance. That shouldn't
// transition the instance to failed. But if the sled-agent
// *can't* boot a stopped instance, that should transition
// to failed.
//
// Without a richer error type, let the sled-agent tell Nexus
// what to do with status codes.
error!(self.log, "saw {} from instance_put!", e);
// Convert to the Omicron API error type.
//
// N.B. The match below assumes that this conversion will turn
// any 400-level error status from sled agent into an
// `Error::InvalidRequest`.
let e = e.into();
match &e {
// Bad request shouldn't change the instance state.
Error::InvalidRequest { .. } => Err(e),
// Internal server error (or anything else) should change
// the instance state to failed, we don't know what state
// the instance is in.
_ => {
let new_runtime = db::model::InstanceRuntimeState {
state: db::model::InstanceState::new(
InstanceState::Failed,
),
gen: db_instance.runtime_state.gen.next().into(),
..db_instance.runtime_state.clone()
};
// XXX what if this fails?
let result = self
.db_datastore
.instance_update_runtime(
&db_instance.id(),
&new_runtime,
)
.await;
error!(
self.log,
"saw {:?} from setting InstanceState::Failed after bad instance_put",
result,
);
Err(e)
}
}

Nexus will take the Progenitor error it got back, feed it through a From impl to convert it to an Omicron external API error, and then see if that error is a Bad Request error or something else. In the former case, the update is dropped; in the latter case, Nexus assumes some catastrophe has happened and that the instance has to be marked as Failed.

The relevant From impl is here:

// TODO-security this `From` may give us a shortcut in situtations where we
// want more robust consideration of errors. For example, while some errors
// from other services may directly result in errors that percolate to the
// external client, others may require, for example, retries with an alternate
// service instance or additional interpretation to sanitize the output error.
// This should be removed to avoid leaking data.
impl<T: ClientError> From<progenitor::progenitor_client::Error<T>> for Error {
fn from(e: progenitor::progenitor_client::Error<T>) -> Self {
match e {
// This error indicates that the inputs were not valid for this API
// call. It's reflective of either a client-side programming error.
progenitor::progenitor_client::Error::InvalidRequest(msg) => {
Error::internal_error(&format!("InvalidRequest: {}", msg))
}
// This error indicates a problem with the request to the remote
// service that did not result in an HTTP response code, but rather
// pertained to local (i.e. client-side) encoding or network
// communication.
progenitor::progenitor_client::Error::CommunicationError(ee) => {
Error::internal_error(&format!("CommunicationError: {}", ee))
}
// This error represents an expected error from the remote service.
progenitor::progenitor_client::Error::ErrorResponse(rv) => {
let message = rv.message();
match rv.status() {
http::StatusCode::SERVICE_UNAVAILABLE => {
Error::unavail(&message)
}
status if status.is_client_error() => {
Error::invalid_request(&message)
}
_ => Error::internal_error(&message),
}
}
// This error indicates that the body returned by the client didn't
// match what was documented in the OpenAPI description for the
// service. This could only happen for us in the case of a severe
// logic/encoding bug in the remote service or due to a failure of
// our version constraints (i.e. that the call was to a newer
// service with an incompatible response).
progenitor::progenitor_client::Error::InvalidResponsePayload(
ee,
) => Error::internal_error(&format!(
"InvalidResponsePayload: {}",
ee,
)),
// This error indicates that the client generated a response code
// that was not described in the OpenAPI description for the
// service; this could be a success or failure response, but either
// way it indicates a logic or version error as above.
progenitor::progenitor_client::Error::UnexpectedResponse(r) => {
Error::internal_error(&format!(
"UnexpectedResponse: status code {}",
r.status(),
))
}
}
}
}

Notice the following:

  • Progenitor errors that didn't produce a response from sled agent--e.g. disconnections due to timeouts or network weather--all get turned into internal errors. These will cause an instance to move to Failed even if the problem was completely transient.
  • Sled agent responses get flattened into either 400 Bad Request (for all 4XX errors) or 500 Internal Server Error (everything else) unless they happen to be 503 Service Unavailable, which is preserved. (See sled agent and Nexus frequently flatten errors into 500 Internal Server Error #3238.)

This creates a great many ways for an instance to be marked as Failed, even if the VM is actually perfectly healthy. The Failed state is also a pain to deal with because a Failed instance can only be deleted and not recovered in any other way (see #2825 and especially #2825 (comment)).

Apart from all of this, an instance can be marked as Failed because of a Propolis failure: for example, if Propolis hits an error while setting up a VM component, it will publish the Failed state and that will bubble back up to Nexus via sled agent.

We should consider a subtler approach to all this. One possible starting place is to distinguish Propolis failures (which really do mean "this VM is dead") from errors accessing the instance from the control plane. We could then, e.g., indicate the latter faults without necessarily changing the instance state, allow further attempts to try to change the instance's state from its last-known state, and/or try to recover connectivity to the instance from a background task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and ManagementnexusRelated to nexusvirtualizationPropolis Integration & VM Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions