Kubernetes autodiscover suite#1064
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
|
cc-ing @mtojek who might be interested into this too |
|
This PR is great @jsoriano! Thanks for the contribution 👏 I added a few comments related to Gherkin style, in order to standardise the creation of the scenarios. I did not change the step functions, so please take a look at them when reviewing my comments for the feature file. I only miss adding the tests to the CI. You can add the new stage here: https://github.com/elastic/e2e-testing/blob/master/.ci/.e2e-tests.yaml, adding a new suite entry (including changes in https://github.com/elastic/e2e-testing/pull/1064/files#r617639126): - suite: "beats-kubernetes"
platforms:
- "ubuntu-18.04"
scenarios:
- name: "k8s autodiscover"
tags: "autodiscover"In a follow up PR, once this is merged, I'll add APM spans and traces for the suite, as we are capturing that data for other test suites. |
|
@adam-stokes I see code here that could be extracted to the new project layout in #1008. I.e. the kubernetes controller |
|
@jsoriano Thanks for the PR, just a heads up we merged a refactor in that moved a lot of the modules into the Thanks again for the contribution! |
Yep, it's possible to reproduce the errors with |
|
Linting hopefully fixed. Some rules seem a bit counter-intuitive to me, such as limiting the name of the feature, forcing the use of "Background" if there are common "Givens" in a feature, or allowing a single "When", what forces to move things to the "Given", or the "Then". But probably I need to work a little more with Gherkin to think more this way 🙂 |
| Given "a short-living cronjob" is deployed | ||
| When "60s" have passed |
There was a problem hiding this comment.
I'd keep it like this, because the When clause is considered the real test, being the Given the requirements/preparation for the scenario. The Given is optional, When, mandatory.
| Given "a short-living cronjob" is deployed | |
| When "60s" have passed | |
| When "a short-living cronjob" is deployed | |
| And "60s" have passed |
There was a problem hiding this comment.
I agree with this, this is what I had before, but gherkin-lint doesn't think the same:
AvoidScripting - Multiple Actions
e2e/_suites/kubernetes-autodiscover/features/filebeat.feature (19): Filebeat.Logs collection from short-living cronjobs
There was a problem hiding this comment.
Let's go back to what the linter wants, I don't see it so bad and this way we don't need to add an exception.
Co-authored-by: Manuel de la Peña <social.mdelapenya@gmail.com>
|
Would it be possible/make sense to disable |
I agree in the |
Ok, I have added |
|
I'd say that the lint is telling us that the
And about the number, why 60, or 40? Should we update that number in slower/faster machines? It seems to me that this is an implementation detail that must be hidden from spec. You could use the Nevertheless, if the waiting period is a strong requirement, wdyt about this refactor for that block: When "a short-living cronjob" is deployed
And "60s" have passedto When "a short-living cronjob" is deployed with "60s" of waiting |
The rule must be excluded at the pre-commit level: https://github.com/elastic/e2e-testing/blob/master/.pre-commit-config.yaml#L48 The one you updated is a helper for development purpose to easily reproduce what pre-commit does. |
Yes, we have to wait 60s to avoid flakiness. We are checking here that we are collecting logs from a cronjob, the minimum scheduling available here is one job per minute, and Kubernetes only guarantees that one job is executed every 60 seconds, so since the job is configured till it is executed the first time can effectively pass up to 60 seconds. One alternative I considered was to use the "is running" version of the rule, but the problem that we needed to handle in beats is that short living pods are a bit special and never reach the running state, they can go directly from waiting/pending to succeed/failed or even to be deleted. We handle these things in Beats, and I wouldn't want to copy this logic into the tests too, so I thought that it would be better to model what a user would expect: the job has to run every 60 seconds, so after this time I should have the logs from the first pod. I also considered to don't declare the wait explicitly, the next step
It is not related to the performance of the machine, but about the kubernetes controller for cronjobs.
Yep, I also considered to have specific steps for cronjobs, but all of them sounded to me as "the cronjob is deployed and the first period has passed", or "the cronjob is deployed and it has executed at least once", so it sounded to me that better do it as two separate steps. |
mdelapenya
left a comment
There was a problem hiding this comment.
Thank you for your patience in this review @jsoriano. I like it a lot how the new test suite is specified, being the scenarios and steps very definite.
Good job!
|
@jsoriano please check what branches we should backport this code. Please add the |
|
Thanks for the reviews and suggestions! This has been a great exercise 🙂 |
Adds a new suite to test Kubernetes autodiscover features. Reproduces some issues with short-living containers affecting Beats before 7.13 and for what we didn't have tests coverage. It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates. It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources. It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace scenarios. Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com> Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com> (cherry picked from commit c52395d)
Adds a new suite to test Kubernetes autodiscover features. Reproduces some issues with short-living containers affecting Beats before 7.13 and for what we didn't have tests coverage. It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates. It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources. It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace scenarios. Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com> Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com> (cherry picked from commit c52395d) Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
* master: Support fleet-server-service-token (elastic#1096) fix: use wider selector for ARM workers (elastic#1103) feat: bootstrap fleet-server for the deployment of regular elastic-agents (elastic#1078) fix: use proper variable name (elastic#1102) fix: branch_specifier is needed (elastic#1097) Move kubernetes/kubectl/kind code to internal project layout (elastic#1092) fix: update JJBB with proper values (elastic#1093) feat: support building centos/debian Docker images in multiplatform format (elastic#1091) Kubernetes autodiscover suite (elastic#1064)
What does this PR do?
It adds a new suite to test Kubernetes autodiscover features.
It reproduces elastic/beats#22718, for what we didn't have tests before.
It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates.
The objective is to make it easy to add more cases only by adding them to the feature files and providing new templates.
It uses kubectl to interact with a kubernetes cluster, if no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources.
It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace cases.
It currently tests beats only, but it is not tied to beats in any way, it could be extended in the future to be used with agent.
Why is it important?
Kubernetes autodiscover depends on the states of several components with many corner cases, and the transitions of these states. This complicates its implementation. And it is difficult to maintain a complete test suite.
The lack of a full featured test suite has caused the introduction of regressions in the past.
Checklist
make noticein the proper directory)Author's Checklist
shell.Executeis too verbose in case of error, consider alternatives.Consider reusing more features available in this repo.TBD in follow ups.How to test this PR locally
kindandkubectl.godogfrom thee2e/_suites/kubernetes-autodiscoverdirectory.Related issues
Logs
With
BEAT_VERSION=7.12.0 godogit is possible to reproduce elastic/beats#22718:And confirm that it passes with current master after elastic/beats#24742.