Add basic cgroup CPU metrics#21029
Add basic cgroup CPU metrics#21029jasontedor merged 8 commits intoelastic:masterfrom jasontedor:cgroups
Conversation
This commit adds basic cgroup CPU metrics to the node stats API.
This commit adds some forbidden API suppressions to methods in OsProbe that are reading from various cgroup virtual files.
This commit adds a file permission for /sys/fs/cgroup/cpu/- so that the OS probe can read from here when reading CPU cgroup stats.
This commit fixes the name of a method in OsProbe. In particular, a method that previously got the control group for the current process for the cpuacct sub-system has since evolved to get all control groups for the current process. This commit reflects this by renaming this method.
|
Since a review hasn't started, I'm going to rebase this on master to resolve merge conflicts introduced by #21037. |
This commit adds Javadocs for the code that obtains the basic cgroup stats.
s1monw
left a comment
There was a problem hiding this comment.
I left a bunch of comments look good though
| // note that Matcher#matches must be invoked as | ||
| // matching is lazy; this can not happen in an assert | ||
| // as assertions might not be enabled | ||
| if (!matcher.matches()) { |
There was a problem hiding this comment.
can we do it like this:
final boolean matches = matcher.matches();
assert matches : line;more readable?
| final List<String> lines = Files.readAllLines(path); | ||
| assert lines != null && lines.size() == 1; | ||
| return lines.get(0); | ||
| } catch (final IOException e) { |
There was a problem hiding this comment.
do we really need this leniency here everywhere? I mean can't we catch the expection on a higher level and skip the entire stats? I feel this kind of stuff will just hide problems from us and adds lots of unnecessary lenient code?
| out.writeString(cpuControlGroup); | ||
| out.writeLong(cpuCfsPeriodMicros); | ||
| out.writeLong(cpuCfsQuotaMicros); | ||
| if (cpuStat == null) { |
There was a problem hiding this comment.
use writeOptionalWriteable here?
| cpuControlGroup = in.readString(); | ||
| cpuCfsPeriodMicros = in.readLong(); | ||
| cpuCfsQuotaMicros = in.readLong(); | ||
| if (!in.readBoolean()) { |
There was a problem hiding this comment.
in.readOptionalWriteable(CpuStat::new);
| @Override | ||
| public XContentBuilder toXContent(final XContentBuilder builder, final Params params) throws IOException { | ||
| builder.startObject("cgroup"); | ||
| { |
| * @return the lines from {@code /proc/self/cgroup} or an empty | ||
| * list. | ||
| */ | ||
| @SuppressForbidden(reason = "access /proc/self/cgroup") |
There was a problem hiding this comment.
can we please resolve this in Environment and pass it through via MonitorService -> OsService -> OsStats I think we should really try to not use PathUtils.get at all.
There was a problem hiding this comment.
Can this be a follow-up? We have to push Environment all the way down from Bootstrap because we initialize the probes there before the security manager is installed.
There was a problem hiding this comment.
it can totally be a followup but MonitorService already has it no?
This commit improves the readability of control group matching in OsProbe.
This commit removes some leniency in the handling of I/O exceptions during the reading of cgroup stats.
|
Thanks for the review @s1monw; I've responded to your feedback. |
This commit fixes an issue when cgroups are not available and thus the cgroup field is null. This can happen on non-Linux systems, for example, or Linux kernels where cgroups are not compiled in.
|
LGTM provided the current commits CI run passes I really want to follow up on the PathUtils stuff mentioned, please open a followup for this, Progress over Perfection. |
|
test this please |
1 similar comment
|
test this please |
|
@s1monw CI passes; I'll merge soon and open a few follow-ups. Thanks for reviewing. |
This commit adds basic cgroup CPU metrics to the node stats API. Relates #21029
This commit fixes a silly doc bug where the field that represents the total CPU time consumed by all tasks in the same cgroup was mistakenly reported as "usage" instead of "usage_nanos". Relates #21029
This commit fixes a silly doc bug where the field that represents the total CPU time consumed by all tasks in the same cgroup was mistakenly reported as "usage" instead of "usage_nanos". Relates #21029
This commit fixes a silly doc bug where the field that represents the total CPU time consumed by all tasks in the same cgroup was mistakenly reported as "usage" instead of "usage_nanos". Relates #21029
* Add basic cgroup CPU metrics (elastic/elasticsearch#21029)
This commit adds basic cgroup CPU metrics to the node stats API.