`publish_state_to_nexus` should distinguish between transient and permanent update failures

Repro steps: No specific steps yet (that I know of). This was observed in the dogfood rack, where sled agent's attempts to update an instance were getting 404 Not Found back from Nexus. The instance was previously destroyed, so I suspect a race like the following:

1. Instance begins to stop
2. Propolis publishes the Stopping and Stopped states to sled agent
3. Sled agent publishes the Stopped state to Nexus, which records it
4. Deletion saga begins
5. Deletion saga reaches `sid_delete_instance_record`, which calls `project_delete_instance`, which sets `time_deleted` on the instance
6. Propolis publishes the Destroyed state
7. Sled agent converts this to a second "Stopped" transition and tries to send another update to Nexus
8. But `instance_update_runtime` fails to find the instance because its `time_deleted` is no longer NULL
9. Sled agent loops forever trying to send this update

The "loop forever" behavior is new. As noted above, when sled agent observes that a Propolis has changed state, it publishes the updated state to Nexus using its (sled agent's) `publish_state_to_nexus` function. #3127 added a retry-with-backoff loop to this function to try to avoid one of the failure modes described in #2727: a transient error when contacting Nexus would cause the Propolis state monitor to exit completely, causing the instance to be stuck more or less forever.

The new retry procedure doesn't try to distinguish between transient and permanent errors. It should be more discerning and should not repeatedly send messages to Nexus that are permanently bound to fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`publish_state_to_nexus` should distinguish between transient and permanent update failures #3230

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

publish_state_to_nexus should distinguish between transient and permanent update failures #3230

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`publish_state_to_nexus` should distinguish between transient and permanent update failures #3230