sled agent: improve instance state update retry loop by gjcolombo · Pull Request #3256 · oxidecomputer/omicron

gjcolombo · 2023-05-30T17:20:56Z

Refine the instance state update retry loop as follows:

Check return codes from Nexus much more carefully. In particular, treat client errors as permanent errors that should stop the update loop. This avoids looping forever if an instance is deleted (in Nexus) when its Propolis is stopped but not yet fully torn down (see publish_state_to_nexus should distinguish between transient and permanent update failures #3230 for a description of how this occurs).
Log information about the updates that are being pushed. Also be sure to log when an update attempt fails permanently.
Switch the retry policy from "local" (meant for operations that don't leave the sled) to the much less aggressive "internal service." The 'caller is invoking other internal services' semantics are especially appropriate here because post-migration state updates can cause Nexus to do a lot of work that contacts other rack services.
Don't bail out of the instance state monitor task merely because it failed to push an update to Nexus, since this will cause future updates from the task's Propolis to be ignored.

Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few instances; reproduced #3230 and #3231 and observed that the notification worker stops retrying if an intervening instance destruction causes a notification to return 404 Not Found.

Fixes #3230. Fixes #3231.

jmpesp

I was able to reproduce a case where a sled agent received an instance stop request, but didn't tear down the corresponding propolis zone because Nexus was returning 404s for the instance state update. I deployed and tested with this PR and now the propolis zones are properly being torn down. LGTM 🚀

gjcolombo · 2023-06-02T16:04:23Z

I've been running this under automated stress testing and have not seen this issue recur. Will rebase and get this merged shortly.

Refine the instance state update retry loop as follows: - Check return codes from Nexus much more carefully. In particular, treat client errors as permanent errors that should stop the update loop. This avoids looping forever if an instance is deleted (in Nexus) when its Propolis is stopped but not yet fully torn down (see #3230). - Log information about the updates that are being pushed. Also be sure to log when an update attempt fails permanently. - Switch the retry policy from "local" (meant for operations that don't leave the sled) to the much less aggressive "internal service." The 'caller is invoking other internal services' semantics are especially appropriate here because post-migration state updates can cause Nexus to do a lot of work that contacts other rack services. - Don't bail out of the instance state monitor task merely because it failed to push an update to Nexus, since this will cause future updates from the task's Propolis to be ignored. Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few instances; reproduced #3230 and #3231 and observed that the notification worker stops retrying if an intervening instance destruction causes a notification to return 404 Not Found. Fixes #3230. Fixes #3231.

gjcolombo marked this pull request as ready for review May 30, 2023 18:53

jmpesp approved these changes May 30, 2023

View reviewed changes

gjcolombo force-pushed the gjcolombo/better-sled-agent-retry branch from 3ecbdb4 to 2162dbd Compare May 31, 2023 15:53

gjcolombo force-pushed the gjcolombo/better-sled-agent-retry branch from 2162dbd to 537d646 Compare June 2, 2023 16:12

gjcolombo merged commit cb09a84 into main Jun 2, 2023

gjcolombo deleted the gjcolombo/better-sled-agent-retry branch June 2, 2023 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent: improve instance state update retry loop#3256

sled agent: improve instance state update retry loop#3256
gjcolombo merged 1 commit into
mainfrom
gjcolombo/better-sled-agent-retry

gjcolombo commented May 30, 2023

Uh oh!

jmpesp left a comment

Uh oh!

gjcolombo commented Jun 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gjcolombo commented May 30, 2023

Uh oh!

jmpesp left a comment

Choose a reason for hiding this comment

Uh oh!

gjcolombo commented Jun 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants