kubelet: improve CRI stats for resource metrics and testing#135604
Conversation
|
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Dec 5 03:34:55 UTC 2025. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
(some testing done in containerd/containerd#12620 using branch https://github.com/dims/kubernetes/tree/add-logs-to-kubelet-metrics) |
42f4108 to
e68d29f
Compare
|
/assign @SergeyKanzhelev @mrunalp |
|
/assign @haircommander @mrunalp |
properly support the resource metrics endpoint when `PodAndContainerStatsFromCRI` is enabled and fix the related e2e tests. Stats Provider: - add container-level CPU and memory stats to `ListPodCPUAndMemoryStats` so the resource metrics endpoint has complete data - add `aggregatePodSwapStats` to compute pod-level swap from container stats (CRI doesn't provide pod-level swap directly) - add missing memory stats fields: `AvailableBytes`, `PageFaults`, and `MajorPageFaults` - add platform-specific implementations for Linux and Windows Tests: - skip cAdvisor metrics test when `PodAndContainerStatsFromCRI` is enabled (cAdvisor metrics aren't available in that mode) - fix expected metrics in `ResourceMetricsAPI` test - `node_swap_usage_bytes` is only available with cAdvisor (need to verify!) - Add `dumpResourceMetricsForPods` helper to log actual metric values when tests fail, making debugging easier Signed-off-by: Davanum Srinivas <davanum@gmail.com>
e68d29f to
914ddf4
Compare
|
eventually we could consider extending CRI to have CRI impl aggregate swap for pod if that's a value we want to rely on /lgtm |
|
LGTM label has been added. DetailsGit tree hash: 4b2915099db1e3c7b24a07bab557b86c9d35937a |
|
/triage accepted |
|
Retesting failed PR that otherwise appears ready for merge. Please help us fix flaky tests by following our Flaky Tests Guide. Prevent this bot from retesting with /retest-required |
properly support the resource metrics endpoint when
PodAndContainerStatsFromCRIis enabled and fix the related e2e tests.Stats Provider:
ListPodCPUAndMemoryStatsso the resource metrics endpoint has complete dataaggregatePodSwapStatsto compute pod-level swap from container stats (CRI doesn't provide pod-level swap directly)AvailableBytes,PageFaults, andMajorPageFaultsTests:
PodAndContainerStatsFromCRIis enabled (cAdvisor metrics aren't available in that mode)ResourceMetricsAPItestnode_swap_usage_bytesis only available with cAdvisor (need to verify!)dumpResourceMetricsForPodshelper to log actual metric values when tests fail, making debugging easierWhat type of PR is this?
/kind cleanup
What this PR does / why we need it:
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: