Kubernetes autodiscover suite by jsoriano · Pull Request #1064 · elastic/e2e-testing

jsoriano · 2021-04-21T11:37:02Z

What does this PR do?

It adds a new suite to test Kubernetes autodiscover features.

It reproduces elastic/beats#22718, for what we didn't have tests before.

It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates.

The objective is to make it easy to add more cases only by adding them to the feature files and providing new templates.

It uses kubectl to interact with a kubernetes cluster, if no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources.

It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace cases.

It currently tests beats only, but it is not tied to beats in any way, it could be extended in the future to be used with agent.

Why is it important?

Kubernetes autodiscover depends on the states of several components with many corner cases, and the transitions of these states. This complicates its implementation. And it is difficult to maintain a complete test suite.

The lack of a full featured test suite has caused the introduction of regressions in the past.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have run the Unit tests for the CLI, and they are passing locally
I have run the End-2-End tests for the suite I'm working on, and they are passing locally
I have noticed new Go dependencies (run make notice in the proper directory)

Author's Checklist

Ensure that this is executed in CI jobs
Use proper beats images when running for PRs
shell.Execute is too verbose in case of error, consider alternatives.
~~Consider reusing more features available in this repo.~~ TBD in follow ups.
Add some docs.

How to test this PR locally

Install kind and kubectl.
Run godog from the e2e/_suites/kubernetes-autodiscover directory.

Related issues

Tests Refactor kubernetes autodiscover to avoid skipping short-living pods beats#24742

Logs

With BEAT_VERSION=7.12.0 godog it is possible to reproduce elastic/beats#22718:

  Scenario: Short-living cronjob                                                       # features/beats.feature:30
    Given a cluster is available                                                       # autodiscover_test.go:335 -> InitializeScenario.func3
    And configuration for "filebeat" has "hints enabled"                               # autodiscover_test.go:75 -> *podsManager
    And "filebeat" is running                                                          # autodiscover_test.go:130 -> *podsManager
    When "a short-living cronjob" is deployed                                          # autodiscover_test.go:116 -> *podsManager
    And "60s" have passed                                                              # autodiscover_test.go:336 -> InitializeScenario.func4
    Then "filebeat" collects events with "kubernetes.container.name:cronjob-container" # autodiscover_test.go:146 -> *podsManager
    timeout waiting for events with kubernetes.container.name:cronjob-container

And confirm that it passes with current master after elastic/beats#24742.

  Scenario: Short-living cronjob                                                       # features/beats.feature:28
    Given a cluster is available                                                       # autodiscover_test.go:335 -> InitializeScenario.func3
    And configuration for "filebeat" has "hints enabled"                               # autodiscover_test.go:75 -> *podsManager
    And "filebeat" is running                                                          # autodiscover_test.go:130 -> *podsManager
    When "a short-living cronjob" is deployed                                          # autodiscover_test.go:116 -> *podsManager
    And "10s" have passed                                                              # autodiscover_test.go:336 -> InitializeScenario.func4
    Then "filebeat" collects events with "kubernetes.container.name:cronjob-container" # autodiscover_test.go:146 -> *podsManager

e2e/_suites/kubernetes-autodiscover/features/beats.feature

elasticmachine · 2021-04-21T12:03:22Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #1064 updated
Start Time: 2021-04-26T08:36:44.189+0000
Duration: 21 min 46 sec
Commit: 164ae9d

Test stats 🧪

Test	Results
Failed	0
Passed	153
Skipped	0
Total	153

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	153
Skipped	0
Total	153

ChrsMark · 2021-04-21T12:22:25Z

cc-ing @mtojek who might be interested into this too

cli/shell/shell.go

e2e/_suites/kubernetes-autodiscover/kubernetes_test.go

e2e/_suites/kubernetes-autodiscover/features/beats.feature

mdelapenya · 2021-04-21T15:12:33Z

This PR is great @jsoriano! Thanks for the contribution 👏

I added a few comments related to Gherkin style, in order to standardise the creation of the scenarios. I did not change the step functions, so please take a look at them when reviewing my comments for the feature file.

I only miss adding the tests to the CI. You can add the new stage here: https://github.com/elastic/e2e-testing/blob/master/.ci/.e2e-tests.yaml, adding a new suite entry (including changes in https://github.com/elastic/e2e-testing/pull/1064/files#r617639126):

  - suite: "beats-kubernetes"
    platforms:
      - "ubuntu-18.04"
    scenarios:
      - name: "k8s autodiscover"
        tags: "autodiscover"

In a follow up PR, once this is merged, I'll add APM spans and traces for the suite, as we are capturing that data for other test suites.

mdelapenya · 2021-04-21T15:19:25Z

@adam-stokes I see code here that could be extracted to the new project layout in #1008. I.e. the kubernetes controller

e2e/_suites/kubernetes-autodiscover/features/beats.feature

adam-stokes · 2021-04-21T16:36:56Z

@jsoriano Thanks for the PR, just a heads up we merged a refactor in that moved a lot of the modules into the internal directory. A few of your references to cli/shell/shell.go will need to be updated. I'd also like to extract out any general code reuse into the internal directory as well where it makes sense.

Thanks again for the contribution!

mdelapenya · 2021-04-23T19:34:26Z

@mdelapenya It's mentioning gherkinci lint is failing here but I can't figure out what is failing. Other than that this PR is ready to be merged

Yep, it's possible to reproduce the errors with make -C e2e lint (name is misleading, we should change to lint-gherkin)

jsoriano · 2021-04-23T20:36:47Z

Linting hopefully fixed.

Some rules seem a bit counter-intuitive to me, such as limiting the name of the feature, forcing the use of "Background" if there are common "Givens" in a feature, or allowing a single "When", what forces to move things to the "Given", or the "Then". But probably I need to work a little more with Gherkin to think more this way 🙂

mdelapenya · 2021-04-24T05:39:37Z

e2e/_suites/kubernetes-autodiscover/features/filebeat.feature

+  Given "a short-living cronjob" is deployed
+   When "60s" have passed


I'd keep it like this, because the When clause is considered the real test, being the Given the requirements/preparation for the scenario. The Given is optional, When, mandatory.

Suggested change

Given "a short-living cronjob" is deployed

When "60s" have passed

When "a short-living cronjob" is deployed

And "60s" have passed

I agree with this, this is what I had before, but gherkin-lint doesn't think the same:

AvoidScripting - Multiple Actions e2e/_suites/kubernetes-autodiscover/features/filebeat.feature (19): Filebeat.Logs collection from short-living cronjobs

Let's go back to what the linter wants, I don't see it so bad and this way we don't need to add an exception.

e2e/_suites/kubernetes-autodiscover/features/filebeat.feature

Co-authored-by: Manuel de la Peña <social.mdelapenya@gmail.com>

jsoriano · 2021-04-26T07:56:49Z

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

mdelapenya · 2021-04-26T07:58:18Z

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

I agree in the UseBackground one, although I think the UseScripting rule helps scenario writers to focus in more straightforward scenarios

jsoriano · 2021-04-26T08:11:13Z

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

I agree in the UseBackground one, although I think the UseScripting rule helps scenario writers to focus in more straightforward scenarios

Ok, I have added UseBackground to the list of exceptions and reverted the related changes. I have also fixed the UseScripting case. I would prefer to have the And under the When too as discussed in #1064 (comment), but not so strongly to need to add an exception to the linter.

mdelapenya · 2021-04-26T08:11:49Z

I'd say that the lint is telling us that the 60s have passed sounds weird from a specification stand point. Is it required to wait a number of secs to collect events?

I know we have something similar in Metricbeat, which we have considered to replace for long time.

And about the number, why 60, or 40? Should we update that number in slower/faster machines? It seems to me that this is an implementation detail that must be hidden from spec. You could use the TIMEOUT_FACTOR variable to define this waiting period internally, representing a multiplier of 3 to a default timeout you set in your codebase.

Nevertheless, if the waiting period is a strong requirement, wdyt about this refactor for that block:

  When "a short-living cronjob" is deployed
   And "60s" have passed

to

  When "a short-living cronjob" is deployed with "60s" of waiting

mdelapenya · 2021-04-26T08:17:03Z

Ok, I have added UseBackground to the list of exceptions and reverted the related changes.

The rule must be excluded at the pre-commit level: https://github.com/elastic/e2e-testing/blob/master/.pre-commit-config.yaml#L48

The one you updated is a helper for development purpose to easily reproduce what pre-commit does.

jsoriano · 2021-04-26T08:32:34Z

I'd say that the lint is telling us that the 60s have passed sounds weird from a specification stand point. Is it required to wait a number of secs to collect events?

Yes, we have to wait 60s to avoid flakiness. We are checking here that we are collecting logs from a cronjob, the minimum scheduling available here is one job per minute, and Kubernetes only guarantees that one job is executed every 60 seconds, so since the job is configured till it is executed the first time can effectively pass up to 60 seconds.

One alternative I considered was to use the "is running" version of the rule, but the problem that we needed to handle in beats is that short living pods are a bit special and never reach the running state, they can go directly from waiting/pending to succeed/failed or even to be deleted. We handle these things in Beats, and I wouldn't want to copy this logic into the tests too, so I thought that it would be better to model what a user would expect: the job has to run every 60 seconds, so after this time I should have the logs from the first pod.
In this case this wait models what the user would do. This is also why I think that having When "60s" have passed maybe not so bad.

I also considered to don't declare the wait explicitly, the next step "filebeat" collects events with "kubernetes.container.name:cronjob-container" has a bigger timeout, so the tests currently pass the same. But if at some moment we decide to reduce the timeout of the collects events step, we may introduce unrelated flakiness.

And about the number, why 60, or 40? Should we update that number in slower/faster machines?

It is not related to the performance of the machine, but about the kubernetes controller for cronjobs.

Nevertheless, if the waiting period is a strong requirement, wdyt about this refactor for that block:
  When "a short-living cronjob" is deployed
   And "60s" have passed
to
  When "a short-living cronjob" is deployed with "60s" of waiting

Yep, I also considered to have specific steps for cronjobs, but all of them sounded to me as "the cronjob is deployed and the first period has passed", or "the cronjob is deployed and it has executed at least once", so it sounded to me that better do it as two separate steps.
This may be a matter of style, so if you think that it'd be better to have a single more specific step, I can do it.

e2e/_suites/kubernetes-autodiscover/features/filebeat.feature

mdelapenya

Thank you for your patience in this review @jsoriano. I like it a lot how the new test suite is specified, being the scenarios and steps very definite.

Good job!

mdelapenya · 2021-04-26T09:11:13Z

@jsoriano please check what branches we should backport this code. Please add the v7.X.0 labels and mergify will do the rest (You can take a look at mergify config file for rules)

jsoriano · 2021-04-26T09:43:29Z

Thanks for the reviews and suggestions! This has been a great exercise 🙂

Adds a new suite to test Kubernetes autodiscover features. Reproduces some issues with short-living containers affecting Beats before 7.13 and for what we didn't have tests coverage. It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates. It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources. It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace scenarios. Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com> Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com> (cherry picked from commit c52395d)

Adds a new suite to test Kubernetes autodiscover features. Reproduces some issues with short-living containers affecting Beats before 7.13 and for what we didn't have tests coverage. It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates. It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources. It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace scenarios. Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com> Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com> (cherry picked from commit c52395d) Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

* master: Support fleet-server-service-token (elastic#1096) fix: use wider selector for ARM workers (elastic#1103) feat: bootstrap fleet-server for the deployment of regular elastic-agents (elastic#1078) fix: use proper variable name (elastic#1102) fix: branch_specifier is needed (elastic#1097) Move kubernetes/kubectl/kind code to internal project layout (elastic#1092) fix: update JJBB with proper values (elastic#1093) feat: support building centos/debian Docker images in multiplatform format (elastic#1091) Kubernetes autodiscover suite (elastic#1064)

jsoriano added 6 commits April 20, 2021 23:22

Initial kubernetes cluster

308124f

Skeleton for pods manager

99d43ef

Complete some scenarios

b7c5f3b

Fix issues with namespace creation

9659a54

Wait for beats to be running

93be1a2

Implement test for short living cronjob

ebab3ac

jsoriano requested review from ChrsMark and mdelapenya April 21, 2021 11:37

jsoriano self-assigned this Apr 21, 2021