Skip to content

want a way for Omicron code to report when supposedly impossible situations happen #10118

@davepacheco

Description

@davepacheco

It would be really handy to have some kind of interface either in Nexus or in Omicron at-large to report a situation that we thought was practically impossible has happened. Here are some examples:

// This error means that we found a row in
// bp_omicron_zone that references a NIC by id but
// there's no corresponding row in
// bp_omicron_zone_nic with that id. This should be
// impossible and reflects either a bug or database
// corruption.
omicron_zone_nics.remove(&id).ok_or_else(|| {
Error::internal_error(&format!(
"zone {:?}: expected to find NIC {:?}, \
but didn't",
z.id, z.bp_nic_id
))
})

// This should be impossible unless we shipped a TUF repo with multiple
// artifacts for the same board and with the same signature. But it
// doesn't prevent us from picking one and proceeding.
// Make a note and proceed.
warn!(
log,
"found more than one matching artifact for RoT bootloader update"
);

// This case should generally be impossible. We *are* a working Nexus
// zone, so how do we not exist in the blueprint? Anyway, there's
// really not much we can do here. Sleep a little bit just to avoid
// spinning rapidly, then try again in hopes that this was some
// transient thing.
//
// (Panicking would not be better. That would likely put us into a
// restart loop until SMF gives up. Then nothing would clear that
// condition.)
let error = anyhow!(
"could not find self ({my_nexus_id}) in blueprint {}",
current_blueprint.id
);
error!(
log,
"try_saga_quiesce(): impossible condition";
InlineErrorChain::new(&*error)
);
return Err(error);

These are operational errors. They're not under the program's control and so we definitely don't want to panic. But we also expect them to never happen in a deployed system. If it did, we'd want to know and we'd want to debug it. It would be great if we had an interface for reporting these. Maybe it could emit an ereport and the fault management subsystem could turn that into an active problem? Or we could drop a little report into the RFD 613 debug dropbox (but then we'd need some way to find these and generate an active problem).

Relatedly, although we have historically advised panicking on all programmer errors (that is, non-operational errors -- things that shouldn't be possible in a working program), that guidance was oriented around situations where the blast radius cannot be well-understood at development-time: unwrap()s that the programmer thought literally couldn't fail, invariant violations (assertion failures), memory corruption, division by zero, a true null pointer dereference, etc. In practice, we've run into many programmer error-type scenarios (where the problem is fully in control of the program and really shouldn't ever be possible) where the blast radius is well-understood. Some examples from sled agent:

// Validate the path that we were given. We're only ever given zone
// root filesystems, whose basename is always a zonename, and we always
// prefix our zone names with `oxz_`. If that's not what we find here,
// log an error and bail out. These error cases should be impossible to
// hit in practice.
let Some(file_name) = zone_path.file_name() else {
error!(
log,
"cannot archive former zone root";
"error" => "path has no filename part",
);
return;
};
if !file_name.starts_with("oxz_") {
error!(
log,
"cannot archive former zone root";
"error" => "filename does not start with \"oxz_\"",
);
return;
}

// It should be impossible for this `try_from()` to fail,
// but it's easy enough to handle gracefully.
Filename::try_from(entry.file_name().to_owned())
.with_context(|| {
format!(
"processing as a file name: {:?}",
entry.file_name(),
)
})

Particularly in Nexus, it's usually better not to panic in these cases because it's not necessary (if the blast radius is understood and bounded) and the impact of doing that in Nexus is potentially huge. (A single restart isn't a big deal, but often the state that caused it is persistent, meaning Nexus will enter a restart loop and eventually be removed from service. That's pretty bad by itself, but if another Nexus winds up taking over the work, we can take out the whole control plane this way.) But we still want to know if these things happen and we want to have the information needed to debug them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DebuggingFor when you want better data in debugging an issue (log messages, post mortem debugging, and more)fault-managementEverything related to the fault-management initiative (RFD480 and others)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions