Skip to content

"Metric leak" which results in restarts caused by out-of-memory #925

@peterbueschel

Description

@peterbueschel

Describe the bug

The number of exposed metric samples for the aws-node-termination-handler increases continuously up until the configured default sample_limit of 5000 (see upstream helm chart). At this point Prometheus marks this target an unhealthy and stops scraping it.

image

At the same also the CPU and memory utilization increase:
image

The memory usage increases up until the configured limit and the container gets an OOM signal.

-> It looks like the timeseries for actions_node is the affected metric.

Relevant

A similar behavior was reported in #665 - but the ticket was closed without any actions/fixes.

Steps to reproduce
Deploy the NTH in a K8s cluster with node events - the number of timeseries on the metric endpoint page will increase but never decrease. The entries for drained nodes, which are not part of the cluster anymore, will stay on the metric page.

Example entry:
actions_node{node_action="cordon-and-drain",node_event_id="asg-lifecycle-term-6231<removed>",node_name="<removed>",node_status="success",service_name="unknown_service:node-termination-handler",telemetry_sdk_language="go",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.20.0"} 1

Expected outcome
The entries for drained nodes on the metric endpoint will be cleaned up after some time.

Environment

  • NTH App Version: v1.19.0
  • NTH Mode (IMDS/Queue processor): Queue processor
  • Kubernetes version: v1.26.9-eks
  • Installation method: helm chart

Metadata

Metadata

Assignees

No one assigned

    Labels

    stalebot-ignoreTo NOT let the stalebot update or close the Issue / PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions