Skip to content

[testing/integration] Fix flaky test cases#5359

Merged
VihasMakwana merged 4 commits intoelastic:mainfrom
VihasMakwana:fix-flaky-handler-tests
Aug 26, 2024
Merged

[testing/integration] Fix flaky test cases#5359
VihasMakwana merged 4 commits intoelastic:mainfrom
VihasMakwana:fix-flaky-handler-tests

Conversation

@VihasMakwana
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes following test cases:

  • TestLongRunningAgentForLeaks
  • TestUpgradeHandlerNewVersion

Why is it important?

  • TestUpgradeHandlerNewVersion fails due to a potential race condition. We rely on time.Sleep(..) to wait, which produces undesirable result if the goroutine sleeps for more than desired time.
    Instead of sleeping, make use of channels to indicate the sucess.
  • TestLongRunningAgentForLeaks failure is explained in Failing test - TestLongRunningAgentForLeaks #5279

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@VihasMakwana VihasMakwana requested a review from a team as a code owner August 26, 2024 12:40
@VihasMakwana VihasMakwana self-assigned this Aug 26, 2024
@VihasMakwana VihasMakwana added skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Aug 26, 2024
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Aug 26, 2024

This pull request does not have a backport label. Could you fix it @VihasMakwana? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

Copy link
Copy Markdown
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Copy Markdown
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small optimization on the new completedChan (optional) but the main question is: why do we need to slow down the ticker for checking the status of agent/components ? I thought the purpose of the fix flaky issue was to avoid the DEGRADED state entirely...

Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go Outdated
Comment thread testing/integration/agent_long_running_leak_test.go
@VihasMakwana
Copy link
Copy Markdown
Contributor Author

VihasMakwana commented Aug 26, 2024

why do we need to slow down the ticker for checking the status of agent/components ? I thought the purpose of the fix flaky issue was to avoid the DEGRADED state entirely...

You’re correct that the goal of fixing flaky issues is to avoid the DEGRADED state altogether. However, sometimes the monitoring components take longer than 10s to report as healthy.

For instance, in the diagnostics failure observed here, the metrics-monitoring-metrics-monitoring-agent reported as degraded until libbeat started the metrics endpoint on the iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock file (which took more than 10s).

I opted for a 60s interval to be cautious and ensure the system has enough time to stabilize. But I think 30s wait would be sufficient. Let me know what do you think

@VihasMakwana VihasMakwana requested a review from pchila August 26, 2024 13:45
@elastic-sonarqube
Copy link
Copy Markdown

@pchila
Copy link
Copy Markdown
Member

pchila commented Aug 26, 2024

I opted for a 60s interval to be cautious and ensure the system has enough time to stabilize. But I think 30s wait would be sufficient. Let me know what do you think

If we need for the status to stabilize I would have extended the period where we wait for everything to be healthy during the setup but after that I would have kept the check every few seconds instead of waiting a full minute for the status checks (basically checking that once we get the HEALTHY state we keep being healthy for the duration of the test).

As a first fix I guess we can leave 60s interval between ticks, maybe we can improve this in following iterations...

@pierrehilbert pierrehilbert added the backport-8.15 Automated backport to the 8.15 branch with mergify label Aug 26, 2024
@mergify mergify bot removed the backport-skip label Aug 26, 2024
@VihasMakwana
Copy link
Copy Markdown
Contributor Author

If we need for the status to stabilize I would have extended the period where we wait for everything to be healthy during the setup but after that I would have kept the check every few seconds instead of waiting a full minute for the status checks (basically checking that once we get the HEALTHY state we keep being healthy for the duration of the test).

I agree with you. I'll work on a follow up PR with this fix. For now, I'll merge this to unblock other PRs.

@VihasMakwana VihasMakwana merged commit d3ba638 into elastic:main Aug 26, 2024
mergify bot pushed a commit that referenced this pull request Aug 26, 2024
* fix: fix concurrency issue with the TestUpgrade* tests

* fix: flaky long running test cases

* chore: minor optimization

(cherry picked from commit d3ba638)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.15 Automated backport to the 8.15 branch with mergify skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing test - TestLongRunningAgentForLeaks [Flaky Test]: TestUpgradeHandlerNewVersion – Not equal

5 participants