want a way for Omicron code to report when supposedly impossible situations happen

It would be really handy to have some kind of interface either in Nexus or in Omicron at-large to report a situation that we thought was practically impossible has happened.  Here are some examples:
https://github.com/oxidecomputer/omicron/blob/8d7f973204c6fbe22a04ee0136eee9f475f34a57/nexus/db-queries/src/db/datastore/deployment.rs#L1380-L1392
https://github.com/oxidecomputer/omicron/blob/8d7f973204c6fbe22a04ee0136eee9f475f34a57/nexus/reconfigurator/planning/src/mgs_updates/rot_bootloader.rs#L150-L157
https://github.com/oxidecomputer/omicron/blob/8d7f973204c6fbe22a04ee0136eee9f475f34a57/nexus/src/app/quiesce.rs#L362-L380

These are operational errors.  They're not under the program's control and so we definitely don't want to panic.  But we also expect them to _never_ happen in a deployed system.  If it did, we'd want to know and we'd want to debug it.  It would be great if we had an interface for reporting these.  Maybe it could emit an ereport and the fault management subsystem could turn that into an active problem?  Or we could drop a little report into the RFD 613 debug dropbox (but then we'd need some way to find these and generate an active problem).

Relatedly, although we have historically advised panicking on all programmer errors (that is, non-operational errors -- things that shouldn't be possible in a working program), that guidance was oriented around situations where the blast radius cannot be well-understood at development-time: `unwrap()`s that the programmer thought literally couldn't fail, invariant violations (assertion failures), memory corruption, division by zero, a true null pointer dereference, etc.  In practice, we've run into many programmer error-type scenarios (where the problem is fully in control of the program and really shouldn't _ever_ be possible) where the blast radius _is_ well-understood.  Some examples from sled agent:
https://github.com/oxidecomputer/omicron/blob/8d7f973204c6fbe22a04ee0136eee9f475f34a57/sled-agent/config-reconciler/src/debug_collector/handle.rs#L167-L188
https://github.com/oxidecomputer/omicron/blob/8d7f973204c6fbe22a04ee0136eee9f475f34a57/sled-agent/config-reconciler/src/debug_collector/file_archiver/filesystem.rs#L79-L87

Particularly in Nexus, it's usually better _not_ to panic in these cases because it's not necessary (if the blast radius is understood and bounded) and the impact of doing that in Nexus is potentially huge.  (A single restart isn't a big deal, but often the state that caused it is persistent, meaning Nexus will enter a restart loop and eventually be removed from service.  That's pretty bad by itself, but if another Nexus winds up taking over the work, we can take out the whole control plane this way.)  But we still want to know if these things happen and we want to have the information needed to debug them.

	// This error means that we found a row in
	// bp_omicron_zone that references a NIC by id but
	// there's no corresponding row in
	// bp_omicron_zone_nic with that id. This should be
	// impossible and reflects either a bug or database
	// corruption.
	omicron_zone_nics.remove(&id).ok_or_else(\|\| {
	Error::internal_error(&format!(
	"zone {:?}: expected to find NIC {:?}, \
	but didn't",
	z.id, z.bp_nic_id
	))
	})

	// This should be impossible unless we shipped a TUF repo with multiple
	// artifacts for the same board and with the same signature. But it
	// doesn't prevent us from picking one and proceeding.
	// Make a note and proceed.
	warn!(
	log,
	"found more than one matching artifact for RoT bootloader update"
	);

	// This case should generally be impossible. We are a working Nexus
	// zone, so how do we not exist in the blueprint? Anyway, there's
	// really not much we can do here. Sleep a little bit just to avoid
	// spinning rapidly, then try again in hopes that this was some
	// transient thing.
	//
	// (Panicking would not be better. That would likely put us into a
	// restart loop until SMF gives up. Then nothing would clear that
	// condition.)
	let error = anyhow!(
	"could not find self ({my_nexus_id}) in blueprint {}",
	current_blueprint.id
	);
	error!(
	log,
	"try_saga_quiesce(): impossible condition";
	InlineErrorChain::new(&*error)
	);
	return Err(error);

	// Validate the path that we were given. We're only ever given zone
	// root filesystems, whose basename is always a zonename, and we always
	// prefix our zone names with `oxz_`. If that's not what we find here,
	// log an error and bail out. These error cases should be impossible to
	// hit in practice.
	let Some(file_name) = zone_path.file_name() else {
	error!(
	log,
	"cannot archive former zone root";
	"error" => "path has no filename part",
	);
	return;
	};

	if !file_name.starts_with("oxz_") {
	error!(
	log,
	"cannot archive former zone root";
	"error" => "filename does not start with \"oxz_\"",
	);
	return;
	}

	// It should be impossible for this `try_from()` to fail,
	// but it's easy enough to handle gracefully.
	Filename::try_from(entry.file_name().to_owned())
	.with_context(\|\| {
	format!(
	"processing as a file name: {:?}",
	entry.file_name(),
	)
	})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

want a way for Omicron code to report when supposedly impossible situations happen #10118

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

want a way for Omicron code to report when supposedly impossible situations happen #10118

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions