Skip to content

need a way to trigger cleanup and next steps for vanished instances #4872

@davepacheco

Description

@davepacheco

There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):

  • when Nexus is notified by Sled Agent (in turn notified from Propolis) that a guest is no longer running (because it was stopped from inside the guest)
  • when a sled reboots, its instances are all gone and this process needs to be applied to each of them (this is restart customer Instances after sled reboot #3633)
  • when we implement support for (ungraceful) sled removal, we'll need to apply this to all of the instances that we thought were on the system that's been removed
  • we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running
  • we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them

Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):

  • this code path exists today where we handle the first case above (where Nexus handles a message from Sled Agent saying the instance is no longer running)
  • this code path has some issues and is not well factored use in other contexts like we need here
  • all of this would be reworked by Tracking: Instance Lifecycle Overhaul #3742
  • the "cleanup" referred to here includes: foreign key from instance to vmm table needs to be reaped; networking state needs to be cleaned up; provisioning counters and sled resources need to be cleaned up; etc.
  • the "next steps" referred to here include: looking at some per-instance "fault discipline" (currently the "boot_on_fault" boolean) to determine if we should start the instance running again somewhere else and, if so, doing that

We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.

Metadata

Metadata

Labels

known issueTo include in customer documentation and trainingvirtualizationPropolis Integration & VM Management

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions