[SRE-27144] Short-term fix for agents stuck in pending#243
Merged
Zakaria-Kofiro merged 2 commits intomasterfrom Jun 2, 2023
Merged
[SRE-27144] Short-term fix for agents stuck in pending#243Zakaria-Kofiro merged 2 commits intomasterfrom
Zakaria-Kofiro merged 2 commits intomasterfrom
Conversation
shawn-h-park
approved these changes
Jun 1, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Short-term fix for agents stuck in pending
As part of this RCA Action Item, a small temporary fix in the agent retry logic has been implemented and tested in QA-Tank.
Agents in
pendingstatus will no longer be relaunched by the agent retry logic. These agents have already reported back to the controller and wait for a start command from the controller while inpendingstatus. They will no longer be terminated if the process from agent initialization to receiving this start command takes more than three minutes. Instead, agents that are still stuck instartingstatus and have not reported back to the controller will be relaunched.This is a short-term solution that has been validated to be working in QA-Tank, and will eventually be overhauled as part of updating agent retry logic. This change also updates existing AgentWatchdog logs with
jobIdvalues and adds two new logs to the controller (JobManager) to track agent behavior between registering agents and making the start call.Reference: Investigating Delay in Agent Start-Up
Example of the fix relaunching an agent stuck in

startingto start the job:Please make sure these check boxes are checked before submitting
mvn clean test -P default** PR review process **