Skip to content

[Security Solution][Cypress] Fix flaky related integrations test caused by Fleet race condition#261128

Merged
banderror merged 3 commits intoelastic:mainfrom
maximpn:fix/related-integrations-fleet-timeout-flakiness
Apr 10, 2026
Merged

[Security Solution][Cypress] Fix flaky related integrations test caused by Fleet race condition#261128
banderror merged 3 commits intoelastic:mainfrom
maximpn:fix/related-integrations-fleet-timeout-flakiness

Conversation

@maximpn
Copy link
Copy Markdown
Contributor

@maximpn maximpn commented Apr 3, 2026

Resolves: #259831
Resolves: #239356

Summary

Mitigates chances of related_integrations.cy.ts failure by reducing pressure on Kibana's Fleet plugin via adding extra waiting for integrations installation before adding agent policies. Generally this mitigates the risk of failure.

Details

Mitigates related_integrations.cy.ts flakiness reasons in the suite where cy.request() timed out waiting for POST /api/fleet/agent_policies?sys_monitoring=true.

Root cause: installIntegrations() fired the bulk package install request and immediately followed with the agent policy creation request. The Fleet bulk install endpoint returns a response once the request is accepted, but processes package assets asynchronously. Under CI load, Fleet was still indexing large packages (aws, system) when the agent policy POST arrived, causing the API to become unresponsive and exceed the 30s default timeout.

Fix:

  • Chain the agent policy creation inside .then() after the bulk install, polling waitForPackageInstalled for each package before proceeding.
  • Increase the agent policy creation timeout from 30s to 60s, as this endpoint is inherently slow with ?sys_monitoring=true.

Flaky test runner

TBD

…waiting package installation before creating agent policy

Wait for each package to reach 'installed' status before calling the
Fleet agent policies API, and increase the agent policy creation
timeout from 30s to 60s. Previously the bulk install was fired and the
agent policy request followed immediately; Fleet was still processing
large packages (aws, system) in the background, causing the agent
policy POST to time out under CI load.

Closes elastic#259831

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@maximpn maximpn self-assigned this Apr 3, 2026
@maximpn maximpn added test release_note:skip Skip the PR/issue when compiling release notes Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team backport:version Backport to applied version labels v9.4.0 v9.3.3 v9.2.8 v8.19.14 labels Apr 3, 2026
@maximpn
Copy link
Copy Markdown
Contributor Author

maximpn commented Apr 3, 2026

/ci

1 similar comment
@maximpn
Copy link
Copy Markdown
Contributor Author

maximpn commented Apr 3, 2026

/ci

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #21 / Entity Analytics - Risk Score Maintainer @ess @serverless @serverlessQA Risk Score Maintainer Task Lifecycle with auditbeat data @skipInServerlessMKI produces additional scores after stop and restart

Metrics [docs]

✅ unchanged

cc @maximpn

@kibanamachine
Copy link
Copy Markdown
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#11365

[❌] [Serverless] Security Solution Rule Management - Cypress: 72/100 tests passed.

see run history

@maximpn maximpn marked this pull request as ready for review April 7, 2026 07:06
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/security-detection-rule-management (Team:Detection Rule Management)

@banderror banderror requested review from a team and dplumlee April 7, 2026 10:30
@banderror
Copy link
Copy Markdown
Contributor

@maximpn Will it fix #239356 as well? I linked this ticket to the PR for now.

@dplumlee
Copy link
Copy Markdown
Contributor

dplumlee commented Apr 9, 2026

Running flaky test runner again since the last one seemed to fail quite a bit

@kibanamachine
Copy link
Copy Markdown
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#11493

[✅] [Serverless] Security Solution Rule Management - Cypress: 100/100 tests passed.

see run history

Copy link
Copy Markdown
Contributor

@dplumlee dplumlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code makes sense to me, looks like it's passing flaky test runner too, let's see how it does in main

@banderror
Copy link
Copy Markdown
Contributor

@elasticmachine merge upstream

@banderror banderror enabled auto-merge (squash) April 10, 2026 09:13
@banderror banderror merged commit 4aefe9c into elastic:main Apr 10, 2026
17 checks passed
@kibanamachine
Copy link
Copy Markdown
Contributor

Starting backport for target branches: 8.19, 9.2, 9.3

https://github.com/elastic/kibana/actions/runs/24238335684

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Apr 10, 2026
…ed by Fleet race condition (elastic#261128)

**Resolves: elastic#259831
**Resolves: elastic#239356

## Summary

Mitigates chances of `related_integrations.cy.ts` failure by reducing
pressure on Kibana's Fleet plugin via adding extra waiting for
integrations installation before adding agent policies. Generally this
mitigates the risk of failure.

## Details

Mitigates `related_integrations.cy.ts` flakiness reasons in the suite
where `cy.request()` timed out waiting for `POST
/api/fleet/agent_policies?sys_monitoring=true`.

**Root cause**: `installIntegrations()` fired the bulk package install
request and immediately followed with the agent policy creation request.
The Fleet bulk install endpoint returns a response once the request is
accepted, but processes package assets asynchronously. Under CI load,
Fleet was still indexing large packages (`aws`, `system`) when the agent
policy POST arrived, causing the API to become unresponsive and exceed
the 30s default timeout.

**Fix**:
- Chain the agent policy creation inside `.then()` after the bulk
install, polling `waitForPackageInstalled` for each package before
proceeding.
- Increase the agent policy creation timeout from 30s to 60s, as this
endpoint is inherently slow with `?sys_monitoring=true`.

## Flaky test runner

TBD

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit 4aefe9c)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Apr 10, 2026
…ed by Fleet race condition (elastic#261128)

**Resolves: elastic#259831
**Resolves: elastic#239356

## Summary

Mitigates chances of `related_integrations.cy.ts` failure by reducing
pressure on Kibana's Fleet plugin via adding extra waiting for
integrations installation before adding agent policies. Generally this
mitigates the risk of failure.

## Details

Mitigates `related_integrations.cy.ts` flakiness reasons in the suite
where `cy.request()` timed out waiting for `POST
/api/fleet/agent_policies?sys_monitoring=true`.

**Root cause**: `installIntegrations()` fired the bulk package install
request and immediately followed with the agent policy creation request.
The Fleet bulk install endpoint returns a response once the request is
accepted, but processes package assets asynchronously. Under CI load,
Fleet was still indexing large packages (`aws`, `system`) when the agent
policy POST arrived, causing the API to become unresponsive and exceed
the 30s default timeout.

**Fix**:
- Chain the agent policy creation inside `.then()` after the bulk
install, polling `waitForPackageInstalled` for each package before
proceeding.
- Increase the agent policy creation timeout from 30s to 60s, as this
endpoint is inherently slow with `?sys_monitoring=true`.

## Flaky test runner

TBD

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit 4aefe9c)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Apr 10, 2026
…ed by Fleet race condition (elastic#261128)

**Resolves: elastic#259831
**Resolves: elastic#239356

## Summary

Mitigates chances of `related_integrations.cy.ts` failure by reducing
pressure on Kibana's Fleet plugin via adding extra waiting for
integrations installation before adding agent policies. Generally this
mitigates the risk of failure.

## Details

Mitigates `related_integrations.cy.ts` flakiness reasons in the suite
where `cy.request()` timed out waiting for `POST
/api/fleet/agent_policies?sys_monitoring=true`.

**Root cause**: `installIntegrations()` fired the bulk package install
request and immediately followed with the agent policy creation request.
The Fleet bulk install endpoint returns a response once the request is
accepted, but processes package assets asynchronously. Under CI load,
Fleet was still indexing large packages (`aws`, `system`) when the agent
policy POST arrived, causing the API to become unresponsive and exceed
the 30s default timeout.

**Fix**:
- Chain the agent policy creation inside `.then()` after the bulk
install, polling `waitForPackageInstalled` for each package before
proceeding.
- Increase the agent policy creation timeout from 30s to 60s, as this
endpoint is inherently slow with `?sys_monitoring=true`.

## Flaky test runner

TBD

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit 4aefe9c)
@kibanamachine
Copy link
Copy Markdown
Contributor

💚 All backports created successfully

Status Branch Result
8.19
9.2
9.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

@maximpn maximpn deleted the fix/related-integrations-fleet-timeout-flakiness branch April 11, 2026 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. test v8.19.14 v9.2.8 v9.3.3 v9.4.0

Projects

None yet

5 participants