"Metric leak" which results in restarts caused by out-of-memory

**Describe the bug**

The number of exposed metric samples for the aws-node-termination-handler increases continuously up until the configured default `sample_limit` of 5000 (see [upstream helm chart](https://github.com/aws/aws-node-termination-handler/blob/main/config/helm/aws-node-termination-handler/values.yaml#L168)). At this point Prometheus marks this target an unhealthy and stops scraping it.

![image](https://github.com/aws/aws-node-termination-handler/assets/5091565/32df59d3-a0cc-4ab2-9f34-39eb3ba1f2f8)

At the same also the CPU and memory utilization increase:
![image](https://github.com/aws/aws-node-termination-handler/assets/5091565/86f21da8-773f-4b10-bcf0-5dcfeadeeb13)

The memory usage increases up until the configured limit and the container gets an OOM signal.

**-> It looks like the timeseries for `actions_node` is the affected metric.**

### Relevant
A similar behavior was reported in https://github.com/aws/aws-node-termination-handler/issues/665 - but the ticket was closed without any actions/fixes.

**Steps to reproduce**
Deploy the NTH in a K8s cluster with node events - the number of timeseries on the metric endpoint page will increase but never decrease. The entries for drained nodes, which are not part of the cluster anymore, will stay on the metric page.

Example entry:
`actions_node{node_action="cordon-and-drain",node_event_id="asg-lifecycle-term-6231<removed>",node_name="<removed>",node_status="success",service_name="unknown_service:node-termination-handler",telemetry_sdk_language="go",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.20.0"} 1`

**Expected outcome**
The entries for drained nodes on the metric endpoint will be cleaned up after some time.

**Environment**

* NTH App Version: v1.19.0
* NTH Mode (IMDS/Queue processor): Queue processor
* Kubernetes version: v1.26.9-eks
* Installation method: helm chart


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Metric leak" which results in restarts caused by out-of-memory #925

Relevant

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Metric leak" which results in restarts caused by out-of-memory #925

Description

Relevant

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions