You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Sled Agent registering itself with Nexus" should also transfer information about "Here are the instances the sled agent knows about". It can start as an empty set. See restart customer Instances after sled reboot #3633 for a lot more detail here.
The Sled Agent should refuse to handle instance requests until it successfully registers itself with Nexus. This would help avoid race conditions where: Nexus sends a request to a rebooting sled, at the same time as the sled registers with nexus and identifies that "all instances are dead now", inadvertently marking a very new instance as failed.
Nexus should look up all instances that should have been running on the sled and mark them failed.
Later Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.
Idea: We could plausible update the "normal" instance provisioning workflow to rely on this RPW for provisioning, too. This would let "instance create" return much faster, and leave the work of "finding an appropriate sled and starting the instance" to a background task that could tolerate slower APIs to the backend.
Ensuring metric registration: As part of the above RPW, one would like to also ensure that running instances have an assignment to an oximeter collector recorded in the omicron.public.metric_producer table. When instances are stopped, that assignment needs to be removed by the cleanup-portion of that RPW.
Instances without Sleds
We need to make it possible for Instances to not have a propolis ID / sled ID, in the case that they are stopped.
We also have the cleanup to do, ensuring that the virtual resources consumed by instances are no longer consumed in the case when an instance is stopped, but not deleted.
Handling Failed Instances
Confirm that instances can be forcefully deleted after being marked failed
Plumb through the sled agent API @gjcolombo mentioned to "force-stop an instance" through the public-facing API for this failed case, to ensure that the instance is truly destroyed.
oximetercollector recorded in theomicron.public.metric_producertable. When instances are stopped, that assignment needs to be removed by the cleanup-portion of that RPW.