`Nexus::instance_start` doesn't check that the target instance is fully created

This was seen in stress testing of PR #5749. That PR seems to hit the problem more often than main does, for reasons we're still investigating, but it looks like it's also a bug on current main so I'm filing a separate issue for it.

This is a regression I introduced in #4194. `Nexus::instance_start` previously rejected attempts to start a Creating instance: https://github.com/oxidecomputer/omicron/blob/79fee6a2c9f0099fb9cca6effbb862fd75c0a4e4/nexus/src/app/instance.rs#L483-L520

In 4194 I changed this path so that it checks the state of the instance's *VMM* but does not pay attention to the state of the *instance*: https://github.com/oxidecomputer/omicron/blob/acbaf4370b876bd8f27b71bbf41fd12abafc09dc/nexus/src/app/instance.rs#L533-L581

The `if let Some(vmm)` clause should have an `else` that checks to make sure that the instance is actually in a state where it's ready to be started (i.e. it should be in `NoVmm`).

At first glance the impact of this seems pretty limited: to hit it, you need to create an instance, then try to start it by name before the create saga is finished. The result that we're seeing is that the start request 500s; the other possible result, I think, is that the VM will start without all the virtual hardware you might expect, which you can recover from by stopping and restarting the instance.

	// If the instance is already starting or running, succeed immediately
	// for idempotency. If the instance is stopped, try to start it. In all
	// other cases return an error describing the state conflict.
	//
	// The "Creating" state is not permitted here (even though a request to
	// create can include a request to start the instance) because an
	// instance that is still being created may not be ready to start yet
	// (e.g. its disks may not yet be attached).
	//
	// If the instance is stopped, the start saga will try to change the
	// instance's state to Starting and increment the instance's state
	// generation number. If this increment fails (because someone else has
	// changed the state), the saga fails. See the saga comments for more
	// details on how this synchronization works.
	match db_instance.runtime_state.state.0 {
	InstanceState::Starting \| InstanceState::Running => {
	return Ok(db_instance)
	}
	InstanceState::Stopped => {}
	_ => {
	return Err(Error::conflict(&format!(
	"instance is in state {} but must be {} to be started",
	db_instance.runtime_state.state.0,
	InstanceState::Stopped
	)))
	}
	}

	let saga_params = sagas::instance_start::Params {
	serialized_authn: authn::saga::Serialized::for_opctx(opctx),
	instance: db_instance,
	ensure_network: true,
	};

	self.execute_saga::<sagas::instance_start::SagaInstanceStart>(
	saga_params,
	)
	.await?;

	let state = self
	.db_datastore
	.instance_fetch_with_vmm(opctx, &authz_instance)
	.await?;
	let (instance, vmm) = (state.instance(), state.vmm());

	if let Some(vmm) = vmm {
	match vmm.runtime.state.0 {
	InstanceState::Starting
	\| InstanceState::Running
	\| InstanceState::Rebooting => {
	debug!(self.log, "asked to start an active instance";
	"instance_id" => %authz_instance.id());

	return Ok(state);
	}
	InstanceState::Stopped => {
	let propolis_id = instance
	.runtime()
	.propolis_id
	.expect("needed a VMM ID to fetch a VMM record");
	error!(self.log,
	"instance is stopped but still has an active VMM";
	"instance_id" => %authz_instance.id(),
	"propolis_id" => %propolis_id);

	return Err(Error::internal_error(
	"instance is stopped but still has an active VMM",
	));
	}
	_ => {
	return Err(Error::conflict(&format!(
	"instance is in state {} but must be {} to be started",
	vmm.runtime.state.0,
	InstanceState::Stopped
	)));
	}
	}
	}

	let saga_params = sagas::instance_start::Params {
	serialized_authn: authn::saga::Serialized::for_opctx(opctx),
	db_instance: instance.clone(),
	};

	self.execute_saga::<sagas::instance_start::SagaInstanceStart>(
	saga_params,
	)
	.await?;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Nexus::instance_start` doesn't check that the target instance is fully created #6274

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Nexus::instance_start doesn't check that the target instance is fully created #6274

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Nexus::instance_start` doesn't check that the target instance is fully created #6274