[testing/integration] Fix flaky test cases#5359
Conversation
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
|
This pull request does not have a backport label. Could you fix it @VihasMakwana? 🙏
NOTE: |
pchila
left a comment
There was a problem hiding this comment.
some small optimization on the new completedChan (optional) but the main question is: why do we need to slow down the ticker for checking the status of agent/components ? I thought the purpose of the fix flaky issue was to avoid the DEGRADED state entirely...
You’re correct that the goal of fixing flaky issues is to avoid the DEGRADED state altogether. However, sometimes the monitoring components take longer than 10s to report as healthy. For instance, in the diagnostics failure observed here, the I opted for a 60s interval to be cautious and ensure the system has enough time to stabilize. But I think |
|
If we need for the status to stabilize I would have extended the period where we wait for everything to be healthy during the setup but after that I would have kept the check every few seconds instead of waiting a full minute for the status checks (basically checking that once we get the HEALTHY state we keep being healthy for the duration of the test). As a first fix I guess we can leave 60s interval between ticks, maybe we can improve this in following iterations... |
I agree with you. I'll work on a follow up PR with this fix. For now, I'll merge this to unblock other PRs. |
* fix: fix concurrency issue with the TestUpgrade* tests * fix: flaky long running test cases * chore: minor optimization (cherry picked from commit d3ba638)





What does this PR do?
This PR fixes following test cases:
TestLongRunningAgentForLeaksTestUpgradeHandlerNewVersionWhy is it important?
TestUpgradeHandlerNewVersionfails due to a potential race condition. We rely ontime.Sleep(..)to wait, which produces undesirable result if the goroutine sleeps for more than desired time.Instead of sleeping, make use of channels to indicate the sucess.
TestLongRunningAgentForLeaksfailure is explained in Failing test - TestLongRunningAgentForLeaks #5279Checklist
./changelog/fragmentsusing the changelog toolRelated issues
Questions to ask yourself