Skip to content

Cherry-pick #24742 to 7.x: Refactor kubernetes autodiscover to avoid skipping short-living pods#25167

Merged
jsoriano merged 1 commit intoelastic:7.xfrom
jsoriano:backport_24742_7.x
Apr 20, 2021
Merged

Cherry-pick #24742 to 7.x: Refactor kubernetes autodiscover to avoid skipping short-living pods#25167
jsoriano merged 1 commit intoelastic:7.xfrom
jsoriano:backport_24742_7.x

Conversation

@jsoriano
Copy link
Copy Markdown
Member

@jsoriano jsoriano commented Apr 20, 2021

Cherry-pick of PR #24742 to 7.x branch. Original message:

What does this PR do?

Refactor logic in kubernetes autodiscover that decides when to generate events to try to address #22718.

Kubernetes autodiscover can generate events without network information now (without host or port/ports). This allows to generate events for pods that haven't started yet, or have succeeded/failed before generating a running event. These events still include the container id, so they can be used to collect logs. Still, no start event is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from pods and their containers.

Some additional small refactors are done to improve readability.

Why is it important?

Current logic is checking similar things at the pod and container levels, try to simplify this logic focusing in containers only.

No matter what is the state of the pod, if there is a container running or trying to run, even if it is unhealthy, some configuration should be generated, so logs can be collected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. But more tests would be needed.
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Fix update so it definitely stops configurations, respecting the cleanup timeout.
  • Double-check that nothing breaks with events without data.host.
    This is logged at the debug level:
    2021-04-08T12:53:34.453+0200	DEBUG	[autodiscover]	template/config.go:156	Configuration template cannot be resolved: field 'data.host' not available in event or environment accessing 'hosts' (source:'/home/jaime/tmp/metricbeat-autodiscover-k8s.yml')
    
  • Actually check that the refactor solves the issues and doesn't break anything else.
  • Try to add more testing. A couple of cases added to current tests.
  • Check that nothing breaks with hints including data.host or data.ports.
    • data.ports.* doesn't seem to be working, check if this is a regression. Named ports were not added to container events. They are now.
    • Multiple configurations generated for pods with multiple containers, expected? compare with master. False alarm.
  • Check autodiscover with heartbeat.

Author's notes for the future

  • When using autodiscover templates, if the condition matches pod conditions, container events will match too because they also include pod metadata. In pods with multiple containers this may lead to events with metadata of the incorrect container. This could be possibly avoided by matching per pod metadata only in the pod events. This is also happening in master, not solving in this PR.
  • data.ports is not included in container-level events, so it doesn't work when conditions for specific containers are used.

How to test this PR locally

  • Check that everything keeps running with some autodiscover happy cases.
  • Check that logs are collected from a pod in CrashLoopBackOff (kubectl run --image=redis redis -- echo foo).

Related issues

Use cases

Collect logs from containers in short-living or failing pods.

…lastic#24742)

Refactor logic in kubernetes autodiscover that decides when
to generate events to try to address issues with short-living
containers.

Kubernetes autodiscover can generate events without network
information now (without host or port/ports). This allows to generate
events for pods that haven't started yet, or have succeeded/failed
before generating a running event. These events still include the
container id, so they can be used to collect logs. Still, no start event
is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from
pods and their containers.

Some additional small refactors are done to improve readability.

(cherry picked from commit b4b6e6e)
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 20, 2021
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 20, 2021
@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #25167 opened

  • Start Time: 2021-04-20T08:58:56.446+0000

  • Duration: 158 min 47 sec

  • Commit: 0a8c5b9

Test stats 🧪

Test Results
Failed 0
Passed 46992
Skipped 5176
Total 52168

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 46992
Skipped 5176
Total 52168

@jsoriano jsoriano merged commit 2f4b0ce into elastic:7.x Apr 20, 2021
@zube zube bot added [zube]: Done and removed [zube]: Inbox labels Apr 20, 2021
@jsoriano jsoriano deleted the backport_24742_7.x branch April 20, 2021 13:23
@zube zube bot removed the [zube]: Done label Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Team:Integrations Label for the Integrations team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants