Tracking: Instance Lifecycle Overhaul

- [ ] **Updating Instance State Information within Nexus**
  - [ ] "Sled Agent registering itself with Nexus" should also transfer information about "Here are the instances the sled agent knows about". It can start as an empty set. See https://github.com/oxidecomputer/omicron/issues/3633 for a lot more detail here.
  - [ ] The Sled Agent should refuse to handle instance requests until it successfully registers itself with Nexus. This would help avoid race conditions where: Nexus sends a request to a rebooting sled, at the same time as the sled registers with nexus and identifies that "all instances are dead now", inadvertently marking a very new instance as failed.
  - [x] Nexus should look up all instances that **should have** been running on the sled and mark them failed.
  - [x]  **Later** Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.
  - [ ] Idea: We could plausible update the "normal" instance provisioning workflow to rely on this RPW for provisioning, too. This would let "instance create" return much faster, and leave the work of "finding an appropriate sled and starting the instance" to a background task that could tolerate slower APIs to the backend.
  - [x] Ensuring metric registration: As part of the above RPW, one would like to also ensure that running instances have an assignment to an `oximeter` collector recorded in the `omicron.public.metric_producer` table. When instances are stopped, that assignment needs to be removed by the cleanup-portion of that RPW.
- [ ] **Instances without Sleds**
  - [x] We need to make it possible for Instances to **not** have a propolis ID / sled ID, in the case that they are stopped.
  - [x] We also have the cleanup to do, ensuring that the virtual resources consumed by instances are no longer consumed in the case when an instance is stopped, but not deleted.
- [ ] **Handling Failed Instances**
  - [x] Confirm that instances can be forcefully deleted after being marked failed
  - [ ] Plumb through the sled agent API @gjcolombo mentioned to "force-stop an instance" through the public-facing API for this failed case, to ensure that the instance is truly destroyed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Instance Lifecycle Overhaul #3742

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tracking: Instance Lifecycle Overhaul #3742

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions