Skip to content

[SRE-27144] Short-term fix for agents stuck in pending#243

Merged
Zakaria-Kofiro merged 2 commits intomasterfrom
zkofiro/temp-agent-fix
Jun 2, 2023
Merged

[SRE-27144] Short-term fix for agents stuck in pending#243
Zakaria-Kofiro merged 2 commits intomasterfrom
zkofiro/temp-agent-fix

Conversation

@Zakaria-Kofiro
Copy link
Collaborator

Short-term fix for agents stuck in pending

As part of this RCA Action Item, a small temporary fix in the agent retry logic has been implemented and tested in QA-Tank.

Agents in pending status will no longer be relaunched by the agent retry logic. These agents have already reported back to the controller and wait for a start command from the controller while in pending status. They will no longer be terminated if the process from agent initialization to receiving this start command takes more than three minutes. Instead, agents that are still stuck in starting status and have not reported back to the controller will be relaunched.

This is a short-term solution that has been validated to be working in QA-Tank, and will eventually be overhauled as part of updating agent retry logic. This change also updates existing AgentWatchdog logs with jobId values and adds two new logs to the controller (JobManager) to track agent behavior between registering agents and making the start call.

Reference: Investigating Delay in Agent Start-Up

Example of the fix relaunching an agent stuck in starting to start the job:
Better Relaunch Example

Please make sure these check boxes are checked before submitting

  • ** Squashed Commits **
  • ** All Tests Passed ** - mvn clean test -P default

** PR review process **

  • Requires one +1 from a reviewer
  • Repository owners will merge your PR once it is approved.

@shawn-h-park shawn-h-park self-requested a review June 1, 2023 22:35
@Zakaria-Kofiro Zakaria-Kofiro merged commit d82b69c into master Jun 2, 2023
@Zakaria-Kofiro Zakaria-Kofiro deleted the zkofiro/temp-agent-fix branch June 2, 2023 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants