node: cpumgr: metrics: add uncore cache alignment metrics#129529
node: cpumgr: metrics: add uncore cache alignment metrics#129529ffromani wants to merge 5 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
75ae4d0 to
8d1196c
Compare
|
/test pull-kubernetes-node-kubelet-serial-containerd |
|
/hold mistakes in the e2e tests |
8d1196c to
3945905
Compare
|
/hold cancel e2e tests should be fixed now |
|
/test pull-kubernetes-node-kubelet-serial-containerd |
|
known issue, investigating |
cbf7801 to
5eb4bba
Compare
|
/hold cancel needs more approval anyway :) |
ffromani
left a comment
There was a problem hiding this comment.
notes to self:
things I'm ok with but not completely happy still
| limitations under the License. | ||
| */ | ||
|
|
||
| package cpumanager |
There was a problem hiding this comment.
still not completely happy about where this code lives
There was a problem hiding this comment.
I think this all falls under topology alignment so topology seems like the right place.
There was a problem hiding this comment.
My only concern is that I want to merge the alignment logic with physical-cpus alignment logic once this PR merged and moving into topology could backfire on this.
| return fmt.Sprintf("aligned=<uncore:%v>", ca.UncoreCache) | ||
| } | ||
|
|
||
| func NewAlignment(topo *topology.CPUTopology, cpus cpuset.CPUSet) Alignment { |
There was a problem hiding this comment.
should probably be renamed CheckAlignment or so
| } | ||
| } | ||
|
|
||
| func isAlignedAtUncoreCache(topo *topology.CPUTopology, cpuList ...int) bool { |
There was a problem hiding this comment.
does this belong in topology?
We know we want to clean up the current code and merge the physical cpu alignment check logic with this one (makes no sense to split it)
There was a problem hiding this comment.
I tend to agree! I think this should be part of topology.
swatisehgal
left a comment
There was a problem hiding this comment.
Thanks for your work on this!
I left a few comments but will do another pass later.
I found it a bit hard to review this because there was quiet some refactoring coupled with new additions. Can we try to split the work into smaller commits to split the refactoring work and new metrics addition?
| "kubelet_container_aligned_compute_resources_count": gstruct.MatchElements(idFn, gstruct.IgnoreExtras, gstruct.Elements{ | ||
| "container::uncore_cache": timelessSample(1), | ||
| }), | ||
| }) |
There was a problem hiding this comment.
Can you help me understand this? We have one metric that captures alignment at various boundaries (e.g. NUMA, uncore_cache, physical CPUs) and here we check for an element (in this case uncore_cache) in that? Why this approach rather than separate alignment metrics altogether?
There was a problem hiding this comment.
Because AFAIK is more idiomatic in prometheus to have related logical metrics grouped in the same metric and differentiate by labels. There are other examples in the codebase, albeit our metrics are kinda inconsistent
| limitations under the License. | ||
| */ | ||
|
|
||
| package cpumanager |
There was a problem hiding this comment.
I think this all falls under topology alignment so topology seems like the right place.
| } | ||
| } | ||
|
|
||
| func isAlignedAtUncoreCache(topo *topology.CPUTopology, cpuList ...int) bool { |
There was a problem hiding this comment.
I tend to agree! I think this should be part of topology.
5eb4bba to
ac47d9d
Compare
|
/retest-required unrelared apiserver failure |
|
/test pull-kubernetes-node-kubelet-serial-containerd |
|
/triage accepted |
ac47d9d to
9107fdb
Compare
abort the test in the unlikely case we fail to create the Policy object. Exactly because this is unlikely we should fail loudly. In my case this happened because unrelated bad parameters, which was hard to catch Signed-off-by: Francesco Romani <fromani@redhat.com>
There's no need to use the generic reflect.DeepEqual, we can use the cpuset Equals method in tests, which is more idiomatic and a tad clearer. Signed-off-by: Francesco Romani <fromani@redhat.com>
get nicer output for free Signed-off-by: Francesco Romani <fromani@redhat.com>
fix test name. Cosmetic only. Signed-off-by: Francesco Romani <fromani@redhat.com>
9107fdb to
9b4d13a
Compare
add missing metric about uncore / L3 / Last-Level cache alignment, plus its e2e tests. Exposing uncore alignment requires a bit of refactoring in the static policy implementation because, differently from full PCPUs alignment and NUMA alignment, can't be easily and safely inferred by construction. The main reason for this is that uncore cache alignment is preferred, not mandatory, thus the cpu allocator can legally use cross-uncore allocation. Because of that, the final cpuset union step can create a final cpuset which is not uncore-aligned even though all its parts are uncore-aligned. The safest way seems thus to run just a final uncore-alignment check once the final cpuset is computed. Signed-off-by: Francesco Romani <fromani@redhat.com>
9b4d13a to
d0c7d68
Compare
|
/retest-required |
|
continuing on #130133 |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Add alignment metrics for uncore (aka last-level) cache (KEP-4800)
Which issue(s) this PR fixes:
Part of the work needed for KEP 4800: kubernetes/enhancements#4800
Special notes for your reviewer:
N/A
Does this PR introduce a user-facing change?