need a way to trigger cleanup and next steps for vanished instances

There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not.  Examples (from concrete to speculative):

- when Nexus is notified by Sled Agent (in turn notified from Propolis) that a guest is no longer running (because it was stopped from inside the guest)
- when a sled reboots, its instances are all gone and this process needs to be applied to each of them (this is #3633)
- when we implement support for (ungraceful) sled removal, we'll need to apply this to all of the instances that we thought were on the system that's been removed
- we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running
- we _may_ want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):

- this code path exists today where we handle the first case above (where Nexus handles a message from Sled Agent saying the instance is no longer running)
- this code path has some issues and is not well factored use in other contexts like we need here
- all of this would be reworked by #3742
- the "cleanup" referred to here includes: foreign key from instance to vmm table needs to be reaped; networking state needs to be cleaned up; provisioning counters and sled resources need to be cleaned up; etc.
- the "next steps" referred to here include: looking at some per-instance "fault discipline" (currently the "boot_on_fault" boolean) to determine if we should start the instance running again somewhere else and, if so, doing that

We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal.  The implementation may well just be #3742.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need a way to trigger cleanup and next steps for vanished instances #4872

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

need a way to trigger cleanup and next steps for vanished instances #4872

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions