CC @unixsurfer @elastic/cloud-infra @henrikno
Question 1: Dealing with breaking changes in CGroupsV2
So, I'm currently working on cgroupsV2 support, which is a tad overdue. One problem I'm running into, conceptually is what to do in cases where metrics differ across cgroups V1 and V2. The most obvious case of this is cpu metrics. In Cgroups V1, we have cpu and cpuacct controllers. In V2, there is no cpuacct controller, and some of the metrics are migrated to the cpu controller. In addition, the blkio controller has been renamed to io. This poses a bit of a problem, as we report metrics in a fairly transparent way:
"cgroup": {
"blkio": {
...
},
"cpu": {
....
},
"cpuacct": {
...
},
....
}
If we maintained this schema for cgroupsV2, we could effectively break metrics for users upgrading their infra to V2, as a variety of metrics would be in a different part of the event data structure. However, shoehorning V2 metrics into V1 names could be deceptive and confusing to users.
How do we feel about this? What's the best way to deal with what is effectively breaking changes in the underlying metrics we're reporting. A few ideas:
- Alias fields in cases where appropriate.
- Add
"version":"v2" field somewhere, and report two different data structures
Question 2: A CGroup Metricset?
This conversation also provides us with the opportunity to float the idea of (eventually) migrating all of the cgroup data over to its own metricset. CGroup data is absolutely massive, and it takes up the majority of a given system/process event. It might benefit us to simply move all of this over to its own metricset. In addition to cleaning up the huge amount of cgroup data in system/process this would also allow us to report more cgroup data that isn't necessarily tied to a given process. This would obviously be done in a non-breaking way. There's a few different ways to do this, with per-process metrics either reported as separate events, or as map in a single per-controller event. The cgroup data in system/process would then be limited to metadata needed to tie a process to a given cgroup controller. Is this something we've considered before?
CC @unixsurfer @elastic/cloud-infra @henrikno
Question 1: Dealing with breaking changes in CGroupsV2
So, I'm currently working on cgroupsV2 support, which is a tad overdue. One problem I'm running into, conceptually is what to do in cases where metrics differ across cgroups V1 and V2. The most obvious case of this is cpu metrics. In Cgroups V1, we have
cpuandcpuacctcontrollers. In V2, there is nocpuacctcontroller, and some of the metrics are migrated to thecpucontroller. In addition, theblkiocontroller has been renamed toio. This poses a bit of a problem, as we report metrics in a fairly transparent way:If we maintained this schema for cgroupsV2, we could effectively break metrics for users upgrading their infra to V2, as a variety of metrics would be in a different part of the event data structure. However, shoehorning V2 metrics into V1 names could be deceptive and confusing to users.
How do we feel about this? What's the best way to deal with what is effectively breaking changes in the underlying metrics we're reporting. A few ideas:
"version":"v2"field somewhere, and report two different data structuresQuestion 2: A CGroup Metricset?
This conversation also provides us with the opportunity to float the idea of (eventually) migrating all of the cgroup data over to its own metricset. CGroup data is absolutely massive, and it takes up the majority of a given
system/processevent. It might benefit us to simply move all of this over to its own metricset. In addition to cleaning up the huge amount ofcgroupdata insystem/processthis would also allow us to report more cgroup data that isn't necessarily tied to a given process. This would obviously be done in a non-breaking way. There's a few different ways to do this, with per-process metrics either reported as separate events, or as map in a single per-controller event. The cgroup data insystem/processwould then be limited to metadata needed to tie a process to a given cgroup controller. Is this something we've considered before?