Cache expiration cause container metrics percentage to default on node metrics

As part of the work on a private SDH, we discovered that if the period time used in the `kubernetes` module for the metrics scraping is larger that the expiration time  of an internal cache used to store those metrics, the cache might always appear empty to some metricsets with the side effect of computing incorrect metrics. 

This internal cache is defined at `metricbeat/module/kubernetes/util/metrics_cache.go` with a fixed hardcoded expiration time of 2 minutes. Let's say the scraping period of the k8s module is 10 minutes, the cache is refilled every 10 minutes by the scrapers but it expires after 2 minutes, meaning that it will be empty for the remaining 8 minutes until the next scraping. There are multiple metricsets that fill this cache, so the order which each metricset execute might cause some of them to find the cache empty.

This cache contains metrics for:
- node.memory.allocatable
- node.cores.allocatable
- container.mem.limit
- container.cores.limit

that are filled by enrichers at the various metricsets.

Some metricset like `kubernetes.container` use those metrics to calculate other metrics like the following percentages:
- container.memory.usage.limit.pct
- container.cpu.usage.limit.pct

I provide here a pseudocode to explain the issue

```golang
if metric.Get("container.mem.limit") == 0:
    memLimit = metric.GetWithDefault("node.memory.allocatable", 0)
else:
    memLimit = metric.Get("container.mem.limit")

if memLimit > 0:
    metric.Put("container.memory.usage.limit.pct", metric("container.memory.usage.bytes")/memLimit))
```

If the memory limit of the container (in the cache) is 0, the value memLimit is set to the allocatable memory of the node (or 0 if that is missing as well). If the final value memLimit is 0, the percentage of the memory usage of the container is not exported at all.

The problem should be fixed by setting the cache expiration time to be strictly larger than the maximum period time of all the scrapers so to allow all the scrapers to have the same view of the cache. We suggest here to set the cache at 2x the maximum period time of all the scrapers.







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache expiration cause container metrics percentage to default on node metrics #31778

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cache expiration cause container metrics percentage to default on node metrics #31778

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions