sled agent: improve instance state update retry loop#3256
Merged
Conversation
jmpesp
approved these changes
May 30, 2023
jmpesp
left a comment
Contributor
There was a problem hiding this comment.
I was able to reproduce a case where a sled agent received an instance stop request, but didn't tear down the corresponding propolis zone because Nexus was returning 404s for the instance state update. I deployed and tested with this PR and now the propolis zones are properly being torn down. LGTM 🚀
3ecbdb4 to
2162dbd
Compare
Contributor
Author
|
I've been running this under automated stress testing and have not seen this issue recur. Will rebase and get this merged shortly. |
Refine the instance state update retry loop as follows: - Check return codes from Nexus much more carefully. In particular, treat client errors as permanent errors that should stop the update loop. This avoids looping forever if an instance is deleted (in Nexus) when its Propolis is stopped but not yet fully torn down (see #3230). - Log information about the updates that are being pushed. Also be sure to log when an update attempt fails permanently. - Switch the retry policy from "local" (meant for operations that don't leave the sled) to the much less aggressive "internal service." The 'caller is invoking other internal services' semantics are especially appropriate here because post-migration state updates can cause Nexus to do a lot of work that contacts other rack services. - Don't bail out of the instance state monitor task merely because it failed to push an update to Nexus, since this will cause future updates from the task's Propolis to be ignored. Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few instances; reproduced #3230 and #3231 and observed that the notification worker stops retrying if an intervening instance destruction causes a notification to return 404 Not Found. Fixes #3230. Fixes #3231.
2162dbd to
537d646
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refine the instance state update retry loop as follows:
publish_state_to_nexusshould distinguish between transient and permanent update failures #3230 for a description of how this occurs).Tested: cargo test; ran an Omicron dev cluster and created/destroyed a few instances; reproduced #3230 and #3231 and observed that the notification worker stops retrying if an intervening instance destruction causes a notification to return 404 Not Found.
Fixes #3230. Fixes #3231.