Skip to content

sled agent: improve instance state update retry loop#3256

Merged
gjcolombo merged 1 commit into
mainfrom
gjcolombo/better-sled-agent-retry
Jun 2, 2023
Merged

sled agent: improve instance state update retry loop#3256
gjcolombo merged 1 commit into
mainfrom
gjcolombo/better-sled-agent-retry

Conversation

@gjcolombo

Copy link
Copy Markdown
Contributor

Refine the instance state update retry loop as follows:

  • Check return codes from Nexus much more carefully. In particular, treat client errors as permanent errors that should stop the update loop. This avoids looping forever if an instance is deleted (in Nexus) when its Propolis is stopped but not yet fully torn down (see publish_state_to_nexus should distinguish between transient and permanent update failures #3230 for a description of how this occurs).
  • Log information about the updates that are being pushed. Also be sure to log when an update attempt fails permanently.
  • Switch the retry policy from "local" (meant for operations that don't leave the sled) to the much less aggressive "internal service." The 'caller is invoking other internal services' semantics are especially appropriate here because post-migration state updates can cause Nexus to do a lot of work that contacts other rack services.
  • Don't bail out of the instance state monitor task merely because it failed to push an update to Nexus, since this will cause future updates from the task's Propolis to be ignored.

Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few instances; reproduced #3230 and #3231 and observed that the notification worker stops retrying if an intervening instance destruction causes a notification to return 404 Not Found.

Fixes #3230. Fixes #3231.

@gjcolombo gjcolombo marked this pull request as ready for review May 30, 2023 18:53

@jmpesp jmpesp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce a case where a sled agent received an instance stop request, but didn't tear down the corresponding propolis zone because Nexus was returning 404s for the instance state update. I deployed and tested with this PR and now the propolis zones are properly being torn down. LGTM 🚀

@gjcolombo gjcolombo force-pushed the gjcolombo/better-sled-agent-retry branch from 3ecbdb4 to 2162dbd Compare May 31, 2023 15:53
@gjcolombo

Copy link
Copy Markdown
Contributor Author

I've been running this under automated stress testing and have not seen this issue recur. Will rebase and get this merged shortly.

Refine the instance state update retry loop as follows:

- Check return codes from Nexus much more carefully. In particular, treat client
  errors as permanent errors that should stop the update loop. This avoids
  looping forever if an instance is deleted (in Nexus) when its Propolis is
  stopped but not yet fully torn down (see #3230).
- Log information about the updates that are being pushed. Also be sure to log
  when an update attempt fails permanently.
- Switch the retry policy from "local" (meant for operations that don't leave
  the sled) to the much less aggressive "internal service." The 'caller is
  invoking other internal services' semantics are especially appropriate here
  because post-migration state updates can cause Nexus to do a lot of work that
  contacts other rack services.
- Don't bail out of the instance state monitor task merely because it failed to
  push an update to Nexus, since this will cause future updates from the task's
  Propolis to be ignored.

Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few
instances; reproduced #3230 and #3231 and observed that the notification worker
stops retrying if an intervening instance destruction causes a notification to
return 404 Not Found.

Fixes #3230. Fixes #3231.
@gjcolombo gjcolombo force-pushed the gjcolombo/better-sled-agent-retry branch from 2162dbd to 537d646 Compare June 2, 2023 16:12
@gjcolombo gjcolombo merged commit cb09a84 into main Jun 2, 2023
@gjcolombo gjcolombo deleted the gjcolombo/better-sled-agent-retry branch June 2, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nexus mysteriously reports 404 for destroyed instance publish_state_to_nexus should distinguish between transient and permanent update failures

2 participants