Repro steps: No specific steps yet (that I know of). This was observed in the dogfood rack, where sled agent's attempts to update an instance were getting 404 Not Found back from Nexus. The instance was previously destroyed, so I suspect a race like the following:
- Instance begins to stop
- Propolis publishes the Stopping and Stopped states to sled agent
- Sled agent publishes the Stopped state to Nexus, which records it
- Deletion saga begins
- Deletion saga reaches
sid_delete_instance_record, which calls project_delete_instance, which sets time_deleted on the instance
- Propolis publishes the Destroyed state
- Sled agent converts this to a second "Stopped" transition and tries to send another update to Nexus
- But
instance_update_runtime fails to find the instance because its time_deleted is no longer NULL
- Sled agent loops forever trying to send this update
The "loop forever" behavior is new. As noted above, when sled agent observes that a Propolis has changed state, it publishes the updated state to Nexus using its (sled agent's) publish_state_to_nexus function. #3127 added a retry-with-backoff loop to this function to try to avoid one of the failure modes described in #2727: a transient error when contacting Nexus would cause the Propolis state monitor to exit completely, causing the instance to be stuck more or less forever.
The new retry procedure doesn't try to distinguish between transient and permanent errors. It should be more discerning and should not repeatedly send messages to Nexus that are permanently bound to fail.
Repro steps: No specific steps yet (that I know of). This was observed in the dogfood rack, where sled agent's attempts to update an instance were getting 404 Not Found back from Nexus. The instance was previously destroyed, so I suspect a race like the following:
sid_delete_instance_record, which callsproject_delete_instance, which setstime_deletedon the instanceinstance_update_runtimefails to find the instance because itstime_deletedis no longer NULLThe "loop forever" behavior is new. As noted above, when sled agent observes that a Propolis has changed state, it publishes the updated state to Nexus using its (sled agent's)
publish_state_to_nexusfunction. #3127 added a retry-with-backoff loop to this function to try to avoid one of the failure modes described in #2727: a transient error when contacting Nexus would cause the Propolis state monitor to exit completely, causing the instance to be stuck more or less forever.The new retry procedure doesn't try to distinguish between transient and permanent errors. It should be more discerning and should not repeatedly send messages to Nexus that are permanently bound to fail.