Agentless on serverless FTR agent deployment fix#201783
Agentless on serverless FTR agent deployment fix#201783jeniawhite merged 10 commits intoelastic:mainfrom
Conversation
acorretti
left a comment
There was a problem hiding this comment.
Agreed to go for 5s and see if the async process is stable enough. This may flake out, but we'll see.
seanrathier
left a comment
There was a problem hiding this comment.
Thanks for taking this on for us.
seanrathier
left a comment
There was a problem hiding this comment.
Sorry for the reverse approval, I just realized this could be flaky with just healthy.
|
|
||
| // wait for eventually healthy status | ||
| await retry.tryForTime(5000, async () => { | ||
| expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy'); |
There was a problem hiding this comment.
| expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy'); | |
| expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Healthy' || 'Pending' ); |
There was a problem hiding this comment.
Maybe we should wait for a Healthy state? Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?
There was a problem hiding this comment.
Or is the purpose of the test case just to see that an agent was created and its state doesn't matter?
The state doesn't matter for this test, but we should be concerned about the state. We want to test that the UX of creating an agentless agent is working.
It could take longer than 5 seconds for the healthy state to reach ES from Fleet and the agentless agent, but it all depends on the performance of the test environment, so we could experience instances of this test failing when in fact it is not.
Perhaps we need a separate test in the Quality Gates here for the Agentless agents? A synthetic test that all the agents are eventually healthy for these QG.
There was a problem hiding this comment.
Changed code to verify Pending or Healthy.
Added comment with explanation.
x-pack/test/cloud_security_posture_functional/agentless/create_agent.ts
Outdated
Show resolved
Hide resolved
…e/kibana into evgb-AgentlessServerlessFTRFix
💔 Build Failed
Failed CI StepsHistory |
…e/kibana into evgb-AgentlessServerlessFTRFix
| expect(await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus()).to.be('Pending'); | ||
|
|
||
| // wait for eventually Pending or Healthy status | ||
| // purpose of this retry is to wait for the agent to be created and the status to be updated | ||
| // not to wait for the agent to be healthy | ||
| await retry.tryForTime(agentCreationTimeout, async () => { | ||
| const resStatus = await cisIntegration.getFirstCspmIntegrationPageAgentlessStatus(); | ||
| expect(resStatus == 'Healthy' || resStatus == 'Pending').to.be(true); | ||
| }); |
There was a problem hiding this comment.
I honestly don't think an assertion that can have two possible states is going to be reliable in the long-term, having resolve as successful with either 'Pending' or 'Healthy' will lead to a false sensation of coverage when some bad configuration makes it stuck on Pending and we end up not catching it.
There was a problem hiding this comment.
Initially implemented with a Healthy assertion, and I agree with what you've said.
But @seanrathier raised that we do not care about the actual agent state, and we just want to make sure that we have a value that makes sense, which indicates that the agent was initiated (which means I assume that the request trigger worked).
I think that it would be great if you both could sync and decide on something, and I will make the relevant changes (I've been implementing back and forth).
One thing that I would like us to do is to verify that the agent got deployed successfully and it works, I won't go into if this test should verify that or some other test (will let you both decide @opauloh @seanrathier).
jen-huang
left a comment
There was a problem hiding this comment.
The prior version of these tests before #199567 simply asserted on the successful creation of an agent policy that contains these agentless integrations.
But after this work, we hide this complexity of "agent policy" for agentless integrations from the user, so I changed it to assert on the existence of an agent status instead (I did not realized that in some environments, a healthy agent could spin up in time?).
I suggest we simply assert on the existence of the status badge (see getFirstCspmIntegrationPageAgentlessStatus) rather than what the actual status reports as, and leave E2E testing elsewhere.
|
@jeniawhite, thanks for taking on this initial work. I can take over this task right now. More changes for this are needed to fix the test and remove some confusion about what is running in Serverless Quality Gates. @jen-huang I will change some of the test so that it is clear what is running in MKI by duplicating the test with a directory called something like |
|
FYI once #199567 (comment) is backported, this will likely need to be backported as well. |
## Summary Noticed that FTR tests are failing: https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8 Following PR: elastic#199567 Looking at the code, I saw that we look at the status and expect a `Pending` status, yet we get a `Healthy` status. It looks like this should be an async flow. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary Noticed that FTR tests are failing: https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8 Following PR: elastic#199567 Looking at the code, I saw that we look at the status and expect a `Pending` status, yet we get a `Healthy` status. It looks like this should be an async flow. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Summary
Noticed that FTR tests are failing:
https://buildkite.com/elastic/appex-qa-serverless-kibana-ftr-tests/builds/3452#01936859-5a54-46bb-8dc5-bc18992bc3b8
Following PR:
#199567
Looking at the code, I saw that we look at the status and expect a
Pendingstatus, yet we get aHealthystatus.It looks like this should be an async flow.
Checklist
Check the PR satisfies following conditions.
Reviewers should verify this PR satisfies this list as well.
release_note:breakinglabel should be applied in these situations.release_note:*label is applied per the guidelinesIdentify risks
Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.
Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.