[testing/integration] Fix flaky test cases by VihasMakwana · Pull Request #5359 · elastic/elastic-agent

VihasMakwana · 2024-08-26T12:40:36Z

What does this PR do?

This PR fixes following test cases:

TestLongRunningAgentForLeaks
TestUpgradeHandlerNewVersion

Why is it important?

TestUpgradeHandlerNewVersion fails due to a potential race condition. We rely on time.Sleep(..) to wait, which produces undesirable result if the goroutine sleeps for more than desired time.
Instead of sleeping, make use of channels to indicate the sucess.
TestLongRunningAgentForLeaks failure is explained in Failing test - TestLongRunningAgentForLeaks #5279

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2024-08-26T12:41:01Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mergify · 2024-08-26T12:41:15Z

This pull request does not have a backport label. Could you fix it @VihasMakwana? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

blakerouse

Looks good.

pchila

some small optimization on the new completedChan (optional) but the main question is: why do we need to slow down the ticker for checking the status of agent/components ? I thought the purpose of the fix flaky issue was to avoid the DEGRADED state entirely...

VihasMakwana · 2024-08-26T13:37:49Z

why do we need to slow down the ticker for checking the status of agent/components ? I thought the purpose of the fix flaky issue was to avoid the DEGRADED state entirely...

You’re correct that the goal of fixing flaky issues is to avoid the DEGRADED state altogether. However, sometimes the monitoring components take longer than 10s to report as healthy.

For instance, in the diagnostics failure observed here, the metrics-monitoring-metrics-monitoring-agent reported as degraded until libbeat started the metrics endpoint on the iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock file (which took more than 10s).

I opted for a 60s interval to be cautious and ensure the system has enough time to stabilize. But I think 30s wait would be sufficient. Let me know what do you think

elastic-sonarqube · 2024-08-26T14:31:00Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

pchila · 2024-08-26T15:11:36Z

I opted for a 60s interval to be cautious and ensure the system has enough time to stabilize. But I think 30s wait would be sufficient. Let me know what do you think

If we need for the status to stabilize I would have extended the period where we wait for everything to be healthy during the setup but after that I would have kept the check every few seconds instead of waiting a full minute for the status checks (basically checking that once we get the HEALTHY state we keep being healthy for the duration of the test).

As a first fix I guess we can leave 60s interval between ticks, maybe we can improve this in following iterations...

VihasMakwana · 2024-08-26T18:32:45Z

If we need for the status to stabilize I would have extended the period where we wait for everything to be healthy during the setup but after that I would have kept the check every few seconds instead of waiting a full minute for the status checks (basically checking that once we get the HEALTHY state we keep being healthy for the duration of the test).

I agree with you. I'll work on a follow up PR with this fix. For now, I'll merge this to unblock other PRs.

* fix: fix concurrency issue with the TestUpgrade* tests * fix: flaky long running test cases * chore: minor optimization (cherry picked from commit d3ba638)

VihasMakwana added 2 commits August 26, 2024 17:20

fix: fix concurrency issue with the TestUpgrade* tests

e1da90c

fix: flaky long running test cases

38f8ba5

VihasMakwana requested a review from a team as a code owner August 26, 2024 12:40

VihasMakwana requested review from blakerouse and pchila August 26, 2024 12:40

VihasMakwana self-assigned this Aug 26, 2024

VihasMakwana added skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Aug 26, 2024

mergify bot added the backport-skip label Aug 26, 2024

blakerouse approved these changes Aug 26, 2024

View reviewed changes

pchila reviewed Aug 26, 2024

View reviewed changes

VihasMakwana and others added 2 commits August 26, 2024 19:14

chore: minor optimization

4851b1d

Merge branch 'main' into fix-flaky-handler-tests

5a3171d

VihasMakwana requested a review from pchila August 26, 2024 13:45

pchila approved these changes Aug 26, 2024

View reviewed changes

pierrehilbert added the backport-8.15 Automated backport to the 8.15 branch with mergify label Aug 26, 2024

mergify bot removed the backport-skip label Aug 26, 2024

VihasMakwana merged commit d3ba638 into elastic:main Aug 26, 2024

mergify bot pushed a commit that referenced this pull request Aug 26, 2024

[testing/integration] Fix flaky test cases (#5359)

7fcd717

* fix: fix concurrency issue with the TestUpgrade* tests * fix: flaky long running test cases * chore: minor optimization (cherry picked from commit d3ba638)

mergify bot mentioned this pull request Aug 26, 2024

[8.15](backport #5359) [testing/integration] Fix flaky test cases #5360

Closed

7 tasks

VihasMakwana mentioned this pull request Aug 29, 2024

[testing/integration] make sure all unit are healthy for agent_long_running_leak_test #5384

Merged

7 tasks

mergify bot mentioned this pull request Sep 4, 2024

[8.15](backport #5384) [testing/integration] make sure all unit are healthy for agent_long_running_leak_test #5424

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[testing/integration] Fix flaky test cases#5359

[testing/integration] Fix flaky test cases#5359
VihasMakwana merged 4 commits intoelastic:mainfrom
VihasMakwana:fix-flaky-handler-tests

VihasMakwana commented Aug 26, 2024

Uh oh!

elasticmachine commented Aug 26, 2024

Uh oh!

mergify bot commented Aug 26, 2024

Uh oh!

blakerouse left a comment

Uh oh!

pchila left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VihasMakwana commented Aug 26, 2024 •

edited

Loading

Uh oh!

elastic-sonarqube bot commented Aug 26, 2024

Uh oh!

pchila commented Aug 26, 2024

Uh oh!

VihasMakwana commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

VihasMakwana commented Aug 26, 2024

What does this PR do?

Why is it important?

Checklist

Related issues

Questions to ask yourself

Uh oh!

elasticmachine commented Aug 26, 2024

Uh oh!

mergify bot commented Aug 26, 2024

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

pchila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VihasMakwana commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elastic-sonarqube bot commented Aug 26, 2024

Quality Gate passed

Uh oh!

pchila commented Aug 26, 2024

Uh oh!

VihasMakwana commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

VihasMakwana commented Aug 26, 2024 •

edited

Loading