Description of the issue
APM agents currently send system metrics that are aligned with Metricbeat's metricset keys, as well as values. These cover system. metricsets and some specific platform-related metrics (see Java agent documentation for example).
However, these system metrics are inaccurate when monitoring containers. The most obvious miscalculation comes from the fact that agents currently collect host total memory rather than the effective cgroup limitation, but there are also considerable differences in the used bytes, depending on how they are retrieved, as well as CPU usage per cgroup quota.
Proposed solution
Introducing new cgroup metrics
As a first step, the new metrics will include:
system.process.cgroup.memory.mem.limit.bytes
system.process.cgroup.memory.mem.usage.bytes
Both are optional.
Both are numeric representing number of bytes.
When not available, these metrics should not be sent.
In the future, we may extend to collect and show additional memory metrics, as well as cpu metrics.
APM UI
System memory usage values will be calculated based on cgroup metrics if such are available, using mem.usage.bytes/mem.limit.bytes. Otherwise, use the existing system.memory metrics.
NOTE: whenever a cgroup is not explicitly limited in memory, the limit read from the corresponding file may be set to 9223372036854771712 (equivalent to 0x7ffffffffffff000), which basically means unlimited.
Agents conforming to the spec should not send this value (they should omit the max cgroup metric in such case)..
Formalizing that in pseudocode:
var total = system.process.cgroup.memory.mem.limit.bytes;
if (total == NA) {
total = system.memory.total;
}
var used = system.process.cgroup.memory.mem.usage.bytes;
if (used == NA) {
used = system.memory.total - system.memory.actual.free;
}
var usage = used / total;
Agent implementation details
https://github.com/elastic/apm/blob/master/specs/agents/metrics.md#cgroup-metrics
Related issues
Description of the issue
APM agents currently send system metrics that are aligned with Metricbeat's metricset keys, as well as values. These cover
system.metricsets and some specific platform-related metrics (see Java agent documentation for example).However, these system metrics are inaccurate when monitoring containers. The most obvious miscalculation comes from the fact that agents currently collect host total memory rather than the effective cgroup limitation, but there are also considerable differences in the used bytes, depending on how they are retrieved, as well as CPU usage per cgroup quota.
Proposed solution
Introducing new cgroup metrics
As a first step, the new metrics will include:
system.process.cgroup.memory.mem.limit.bytessystem.process.cgroup.memory.mem.usage.bytesBoth are optional.
Both are numeric representing number of bytes.
When not available, these metrics should not be sent.
In the future, we may extend to collect and show additional memory metrics, as well as cpu metrics.
APM UI
System memory usage values will be calculated based on cgroup metrics if such are available, using
mem.usage.bytes/mem.limit.bytes. Otherwise, use the existingsystem.memorymetrics.NOTE: whenever a cgroup is not explicitly limited in memory, the limit read from the corresponding file may be set to
9223372036854771712(equivalent to0x7ffffffffffff000), which basically meansunlimited.Agents conforming to the spec should not send this value (they should omit the
maxcgroup metric in such case)..Formalizing that in pseudocode:
Agent implementation details
https://github.com/elastic/apm/blob/master/specs/agents/metrics.md#cgroup-metrics
Related issues