You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):
when Nexus is notified by Sled Agent (in turn notified from Propolis) that a guest is no longer running (because it was stopped from inside the guest)
when we implement support for (ungraceful) sled removal, we'll need to apply this to all of the instances that we thought were on the system that's been removed
we probably want some periodic task that asks Sled Agent what instances it's running and compares that to what we think it should be running
we may want a periodic task that monitors sleds and, if we find that they seem to be unexpectedly not responding, takes steps to make sure that they're offline (i.e., power cycles them via Ignition) and then applies this process for any instances on them
Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):
this code path exists today where we handle the first case above (where Nexus handles a message from Sled Agent saying the instance is no longer running)
this code path has some issues and is not well factored use in other contexts like we need here
the "cleanup" referred to here includes: foreign key from instance to vmm table needs to be reaped; networking state needs to be cleaned up; provisioning counters and sled resources need to be cleaned up; etc.
the "next steps" referred to here include: looking at some per-instance "fault discipline" (currently the "boot_on_fault" boolean) to determine if we should start the instance running again somewhere else and, if so, doing that
We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.
There are various cases where the control plane might discover that Instances that we thought were running are, in fact, not. Examples (from concrete to speculative):
Relaying some notes from @gjcolombo (my apologies where I've got these details wrong):
We may view this issue as a dup of #3742 but I wanted to file this separate ticket to reflect more precisely what's needed for sled removal. The implementation may well just be #3742.