Skip to content
This repository was archived by the owner on Sep 17, 2024. It is now read-only.

Kubernetes autodiscover suite#1064

Merged
jsoriano merged 36 commits intoelastic:masterfrom
jsoriano:kubernetes-autodiscover-suite
Apr 26, 2021
Merged

Kubernetes autodiscover suite#1064
jsoriano merged 36 commits intoelastic:masterfrom
jsoriano:kubernetes-autodiscover-suite

Conversation

@jsoriano
Copy link
Copy Markdown
Member

@jsoriano jsoriano commented Apr 21, 2021

What does this PR do?

It adds a new suite to test Kubernetes autodiscover features.

It reproduces elastic/beats#22718, for what we didn't have tests before.

It allows to define the scenarios using the gherkin language for feature files. Scenarios use templates that are mostly kubernetes manifests. A simple system of options is included to allow to select configuration blocks in these templates.

The objective is to make it easy to add more cases only by adding them to the feature files and providing new templates.

It uses kubectl to interact with a kubernetes cluster, if no cluster is available, it starts one with kind. Each scenario is executed in a different namespace, namespace is destroyed after the scenario is run to clean all created resources.

It currently works only with loads in a single namespace, in a single node, this is enough to test many autodiscover cases, but it could be extended in the future to cover multi-node or multi-namespace cases.

It currently tests beats only, but it is not tied to beats in any way, it could be extended in the future to be used with agent.

Why is it important?

Kubernetes autodiscover depends on the states of several components with many corner cases, and the transitions of these states. This complicates its implementation. And it is difficult to maintain a complete test suite.

The lack of a full featured test suite has caused the introduction of regressions in the past.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have run the Unit tests for the CLI, and they are passing locally
  • I have run the End-2-End tests for the suite I'm working on, and they are passing locally
  • I have noticed new Go dependencies (run make notice in the proper directory)

Author's Checklist

  • Ensure that this is executed in CI jobs
  • Use proper beats images when running for PRs
  • shell.Execute is too verbose in case of error, consider alternatives.
  • Consider reusing more features available in this repo. TBD in follow ups.
  • Add some docs.

How to test this PR locally

  • Install kind and kubectl.
  • Run godog from the e2e/_suites/kubernetes-autodiscover directory.

Related issues

Logs

With BEAT_VERSION=7.12.0 godog it is possible to reproduce elastic/beats#22718:

  Scenario: Short-living cronjob                                                       # features/beats.feature:30
    Given a cluster is available                                                       # autodiscover_test.go:335 -> InitializeScenario.func3
    And configuration for "filebeat" has "hints enabled"                               # autodiscover_test.go:75 -> *podsManager
    And "filebeat" is running                                                          # autodiscover_test.go:130 -> *podsManager
    When "a short-living cronjob" is deployed                                          # autodiscover_test.go:116 -> *podsManager
    And "60s" have passed                                                              # autodiscover_test.go:336 -> InitializeScenario.func4
    Then "filebeat" collects events with "kubernetes.container.name:cronjob-container" # autodiscover_test.go:146 -> *podsManager
    timeout waiting for events with kubernetes.container.name:cronjob-container

And confirm that it passes with current master after elastic/beats#24742.

  Scenario: Short-living cronjob                                                       # features/beats.feature:28
    Given a cluster is available                                                       # autodiscover_test.go:335 -> InitializeScenario.func3
    And configuration for "filebeat" has "hints enabled"                               # autodiscover_test.go:75 -> *podsManager
    And "filebeat" is running                                                          # autodiscover_test.go:130 -> *podsManager
    When "a short-living cronjob" is deployed                                          # autodiscover_test.go:116 -> *podsManager
    And "10s" have passed                                                              # autodiscover_test.go:336 -> InitializeScenario.func4
    Then "filebeat" collects events with "kubernetes.container.name:cronjob-container" # autodiscover_test.go:146 -> *podsManager

@jsoriano jsoriano requested review from ChrsMark and mdelapenya April 21, 2021 11:37
@jsoriano jsoriano self-assigned this Apr 21, 2021
@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Apr 21, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #1064 updated

  • Start Time: 2021-04-26T08:36:44.189+0000

  • Duration: 21 min 46 sec

  • Commit: 164ae9d

Test stats 🧪

Test Results
Failed 0
Passed 153
Skipped 0
Total 153

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 153
Skipped 0
Total 153

@ChrsMark
Copy link
Copy Markdown
Member

cc-ing @mtojek who might be interested into this too

@mdelapenya
Copy link
Copy Markdown
Contributor

This PR is great @jsoriano! Thanks for the contribution 👏

I added a few comments related to Gherkin style, in order to standardise the creation of the scenarios. I did not change the step functions, so please take a look at them when reviewing my comments for the feature file.

I only miss adding the tests to the CI. You can add the new stage here: https://github.com/elastic/e2e-testing/blob/master/.ci/.e2e-tests.yaml, adding a new suite entry (including changes in https://github.com/elastic/e2e-testing/pull/1064/files#r617639126):

  - suite: "beats-kubernetes"
    platforms:
      - "ubuntu-18.04"
    scenarios:
      - name: "k8s autodiscover"
        tags: "autodiscover"

In a follow up PR, once this is merged, I'll add APM spans and traces for the suite, as we are capturing that data for other test suites.

@mdelapenya
Copy link
Copy Markdown
Contributor

@adam-stokes I see code here that could be extracted to the new project layout in #1008. I.e. the kubernetes controller

@adam-stokes
Copy link
Copy Markdown
Contributor

@jsoriano Thanks for the PR, just a heads up we merged a refactor in that moved a lot of the modules into the internal directory. A few of your references to cli/shell/shell.go will need to be updated. I'd also like to extract out any general code reuse into the internal directory as well where it makes sense.

Thanks again for the contribution!

@mdelapenya
Copy link
Copy Markdown
Contributor

@mdelapenya It's mentioning gherkinci lint is failing here but I can't figure out what is failing. Other than that this PR is ready to be merged

Yep, it's possible to reproduce the errors with make -C e2e lint (name is misleading, we should change to lint-gherkin)

@jsoriano
Copy link
Copy Markdown
Member Author

Linting hopefully fixed.

Some rules seem a bit counter-intuitive to me, such as limiting the name of the feature, forcing the use of "Background" if there are common "Givens" in a feature, or allowing a single "When", what forces to move things to the "Given", or the "Then". But probably I need to work a little more with Gherkin to think more this way 🙂

Comment on lines +20 to +21
Given "a short-living cronjob" is deployed
When "60s" have passed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it like this, because the When clause is considered the real test, being the Given the requirements/preparation for the scenario. The Given is optional, When, mandatory.

Suggested change
Given "a short-living cronjob" is deployed
When "60s" have passed
When "a short-living cronjob" is deployed
And "60s" have passed

Copy link
Copy Markdown
Member Author

@jsoriano jsoriano Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, this is what I had before, but gherkin-lint doesn't think the same:

AvoidScripting - Multiple Actions
  e2e/_suites/kubernetes-autodiscover/features/filebeat.feature (19): Filebeat.Logs collection from short-living cronjobs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go back to what the linter wants, I don't see it so bad and this way we don't need to add an exception.

@jsoriano
Copy link
Copy Markdown
Member Author

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

@mdelapenya
Copy link
Copy Markdown
Contributor

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

I agree in the UseBackground one, although I think the UseScripting rule helps scenario writers to focus in more straightforward scenarios

@jsoriano
Copy link
Copy Markdown
Member Author

jsoriano commented Apr 26, 2021

Would it be possible/make sense to disable AvoidScripting and UseBackground in the linter if we don't agree with it?

I agree in the UseBackground one, although I think the UseScripting rule helps scenario writers to focus in more straightforward scenarios

Ok, I have added UseBackground to the list of exceptions and reverted the related changes. I have also fixed the UseScripting case. I would prefer to have the And under the When too as discussed in #1064 (comment), but not so strongly to need to add an exception to the linter.

@mdelapenya
Copy link
Copy Markdown
Contributor

I'd say that the lint is telling us that the 60s have passed sounds weird from a specification stand point. Is it required to wait a number of secs to collect events?

I know we have something similar in Metricbeat, which we have considered to replace for long time.

And about the number, why 60, or 40? Should we update that number in slower/faster machines? It seems to me that this is an implementation detail that must be hidden from spec. You could use the TIMEOUT_FACTOR variable to define this waiting period internally, representing a multiplier of 3 to a default timeout you set in your codebase.

Nevertheless, if the waiting period is a strong requirement, wdyt about this refactor for that block:

  When "a short-living cronjob" is deployed
   And "60s" have passed

to

  When "a short-living cronjob" is deployed with "60s" of waiting

@mdelapenya
Copy link
Copy Markdown
Contributor

Ok, I have added UseBackground to the list of exceptions and reverted the related changes.

The rule must be excluded at the pre-commit level: https://github.com/elastic/e2e-testing/blob/master/.pre-commit-config.yaml#L48

The one you updated is a helper for development purpose to easily reproduce what pre-commit does.

@jsoriano
Copy link
Copy Markdown
Member Author

jsoriano commented Apr 26, 2021

I'd say that the lint is telling us that the 60s have passed sounds weird from a specification stand point. Is it required to wait a number of secs to collect events?

Yes, we have to wait 60s to avoid flakiness. We are checking here that we are collecting logs from a cronjob, the minimum scheduling available here is one job per minute, and Kubernetes only guarantees that one job is executed every 60 seconds, so since the job is configured till it is executed the first time can effectively pass up to 60 seconds.

One alternative I considered was to use the "is running" version of the rule, but the problem that we needed to handle in beats is that short living pods are a bit special and never reach the running state, they can go directly from waiting/pending to succeed/failed or even to be deleted. We handle these things in Beats, and I wouldn't want to copy this logic into the tests too, so I thought that it would be better to model what a user would expect: the job has to run every 60 seconds, so after this time I should have the logs from the first pod.
In this case this wait models what the user would do. This is also why I think that having When "60s" have passed maybe not so bad.

I also considered to don't declare the wait explicitly, the next step "filebeat" collects events with "kubernetes.container.name:cronjob-container" has a bigger timeout, so the tests currently pass the same. But if at some moment we decide to reduce the timeout of the collects events step, we may introduce unrelated flakiness.

And about the number, why 60, or 40? Should we update that number in slower/faster machines?

It is not related to the performance of the machine, but about the kubernetes controller for cronjobs.

Nevertheless, if the waiting period is a strong requirement, wdyt about this refactor for that block:

  When "a short-living cronjob" is deployed
   And "60s" have passed

to

  When "a short-living cronjob" is deployed with "60s" of waiting

Yep, I also considered to have specific steps for cronjobs, but all of them sounded to me as "the cronjob is deployed and the first period has passed", or "the cronjob is deployed and it has executed at least once", so it sounded to me that better do it as two separate steps.
This may be a matter of style, so if you think that it'd be better to have a single more specific step, I can do it.

Copy link
Copy Markdown
Contributor

@mdelapenya mdelapenya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience in this review @jsoriano. I like it a lot how the new test suite is specified, being the scenarios and steps very definite.

Good job!

@mdelapenya
Copy link
Copy Markdown
Contributor

mdelapenya commented Apr 26, 2021

@jsoriano please check what branches we should backport this code. Please add the v7.X.0 labels and mergify will do the rest (You can take a look at mergify config file for rules)

@jsoriano
Copy link
Copy Markdown
Member Author

Thanks for the reviews and suggestions! This has been a great exercise 🙂

@jsoriano jsoriano merged commit c52395d into elastic:master Apr 26, 2021
@jsoriano jsoriano deleted the kubernetes-autodiscover-suite branch April 26, 2021 09:47
mergify bot pushed a commit that referenced this pull request Apr 26, 2021
Adds a new suite to test Kubernetes autodiscover features.
Reproduces some issues with short-living containers affecting Beats before 7.13
and for what we didn't have tests coverage.

It allows to define the scenarios using the gherkin language for feature files.
Scenarios use templates that are mostly kubernetes manifests. A simple system of
options is included to allow to select configuration blocks in these templates.

It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it
starts one with kind. Each scenario is executed in a different namespace, namespace
is destroyed after the scenario is run to clean all created resources.

It currently works only with loads in a single namespace, in a single node, this is
enough to test many autodiscover cases, but it could be extended in the future to cover
multi-node or multi-namespace scenarios.

Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com>
Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com>
(cherry picked from commit c52395d)
adam-stokes pushed a commit that referenced this pull request Apr 26, 2021
Adds a new suite to test Kubernetes autodiscover features.
Reproduces some issues with short-living containers affecting Beats before 7.13
and for what we didn't have tests coverage.

It allows to define the scenarios using the gherkin language for feature files.
Scenarios use templates that are mostly kubernetes manifests. A simple system of
options is included to allow to select configuration blocks in these templates.

It uses kubectl to interact with a kubernetes cluster. If no cluster is available, it
starts one with kind. Each scenario is executed in a different namespace, namespace
is destroyed after the scenario is run to clean all created resources.

It currently works only with loads in a single namespace, in a single node, this is
enough to test many autodiscover cases, but it could be extended in the future to cover
multi-node or multi-namespace scenarios.

Co-authored-by: Manuel de la Peña <mdelapenya@gmail.com>
Co-authored-by: Adam Stokes <51892+adam-stokes@users.noreply.github.com>
(cherry picked from commit c52395d)

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Apr 27, 2021
* master:
  Support fleet-server-service-token (elastic#1096)
  fix: use wider selector for ARM workers (elastic#1103)
  feat: bootstrap fleet-server for the deployment of regular elastic-agents (elastic#1078)
  fix: use proper variable name (elastic#1102)
  fix: branch_specifier is needed (elastic#1097)
  Move kubernetes/kubectl/kind code to internal project layout (elastic#1092)
  fix: update JJBB with proper values (elastic#1093)
  feat: support building centos/debian Docker images in multiplatform format (elastic#1091)
  Kubernetes autodiscover suite (elastic#1064)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants