As part of the work on a private SDH, we discovered that if the period time used in the kubernetes module for the metrics scraping is larger that the expiration time of an internal cache used to store those metrics, the cache might always appear empty to some metricsets with the side effect of computing incorrect metrics.
This internal cache is defined at metricbeat/module/kubernetes/util/metrics_cache.go with a fixed hardcoded expiration time of 2 minutes. Let's say the scraping period of the k8s module is 10 minutes, the cache is refilled every 10 minutes by the scrapers but it expires after 2 minutes, meaning that it will be empty for the remaining 8 minutes until the next scraping. There are multiple metricsets that fill this cache, so the order which each metricset execute might cause some of them to find the cache empty.
This cache contains metrics for:
- node.memory.allocatable
- node.cores.allocatable
- container.mem.limit
- container.cores.limit
that are filled by enrichers at the various metricsets.
Some metricset like kubernetes.container use those metrics to calculate other metrics like the following percentages:
- container.memory.usage.limit.pct
- container.cpu.usage.limit.pct
I provide here a pseudocode to explain the issue
if metric.Get("container.mem.limit") == 0:
memLimit = metric.GetWithDefault("node.memory.allocatable", 0)
else:
memLimit = metric.Get("container.mem.limit")
if memLimit > 0:
metric.Put("container.memory.usage.limit.pct", metric("container.memory.usage.bytes")/memLimit))
If the memory limit of the container (in the cache) is 0, the value memLimit is set to the allocatable memory of the node (or 0 if that is missing as well). If the final value memLimit is 0, the percentage of the memory usage of the container is not exported at all.
The problem should be fixed by setting the cache expiration time to be strictly larger than the maximum period time of all the scrapers so to allow all the scrapers to have the same view of the cache. We suggest here to set the cache at 2x the maximum period time of all the scrapers.
As part of the work on a private SDH, we discovered that if the period time used in the
kubernetesmodule for the metrics scraping is larger that the expiration time of an internal cache used to store those metrics, the cache might always appear empty to some metricsets with the side effect of computing incorrect metrics.This internal cache is defined at
metricbeat/module/kubernetes/util/metrics_cache.gowith a fixed hardcoded expiration time of 2 minutes. Let's say the scraping period of the k8s module is 10 minutes, the cache is refilled every 10 minutes by the scrapers but it expires after 2 minutes, meaning that it will be empty for the remaining 8 minutes until the next scraping. There are multiple metricsets that fill this cache, so the order which each metricset execute might cause some of them to find the cache empty.This cache contains metrics for:
that are filled by enrichers at the various metricsets.
Some metricset like
kubernetes.containeruse those metrics to calculate other metrics like the following percentages:I provide here a pseudocode to explain the issue
If the memory limit of the container (in the cache) is 0, the value memLimit is set to the allocatable memory of the node (or 0 if that is missing as well). If the final value memLimit is 0, the percentage of the memory usage of the container is not exported at all.
The problem should be fixed by setting the cache expiration time to be strictly larger than the maximum period time of all the scrapers so to allow all the scrapers to have the same view of the cache. We suggest here to set the cache at 2x the maximum period time of all the scrapers.