Skip to content

publish_state_to_nexus should distinguish between transient and permanent update failures #3230

@gjcolombo

Description

@gjcolombo

Repro steps: No specific steps yet (that I know of). This was observed in the dogfood rack, where sled agent's attempts to update an instance were getting 404 Not Found back from Nexus. The instance was previously destroyed, so I suspect a race like the following:

  1. Instance begins to stop
  2. Propolis publishes the Stopping and Stopped states to sled agent
  3. Sled agent publishes the Stopped state to Nexus, which records it
  4. Deletion saga begins
  5. Deletion saga reaches sid_delete_instance_record, which calls project_delete_instance, which sets time_deleted on the instance
  6. Propolis publishes the Destroyed state
  7. Sled agent converts this to a second "Stopped" transition and tries to send another update to Nexus
  8. But instance_update_runtime fails to find the instance because its time_deleted is no longer NULL
  9. Sled agent loops forever trying to send this update

The "loop forever" behavior is new. As noted above, when sled agent observes that a Propolis has changed state, it publishes the updated state to Nexus using its (sled agent's) publish_state_to_nexus function. #3127 added a retry-with-backoff loop to this function to try to avoid one of the failure modes described in #2727: a transient error when contacting Nexus would cause the Propolis state monitor to exit completely, causing the instance to be stuck more or less forever.

The new retry procedure doesn't try to distinguish between transient and permanent errors. It should be more discerning and should not repeatedly send messages to Nexus that are permanently bound to fail.

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and Management

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions