node: cpumanager: add options to reject non SMT-aligned workload#101432
node: cpumanager: add options to reject non SMT-aligned workload#101432k8s-ci-robot merged 6 commits intokubernetes:masterfrom
Conversation
|
/cc @fromanirh |
|
For some more context, on April 20 sig-node meeting we presented this work, but also talked in general about how to extend cpumanager. We thought about demonstrating a prototype, but this change is pretty small and self-contained, so we're presenting the actual complete PR alongside the the KEP (kubernetes/enhancements#2626). To elaborate about the magnitude and the scope of the changes, the label (XXL) is a bit deceptive: $ git log -6 -p | diffstat | grep -v test
b/pkg/kubelet/cm/container_manager_linux.go | 23
b/pkg/kubelet/cm/container_manager_stub.go | 2
b/pkg/kubelet/cm/cpumanager/cpu_manager.go | 90 +-
b/pkg/kubelet/cm/cpumanager/fake_cpu_manager.go | 10
b/pkg/kubelet/cm/cpumanager/policy.go | 5
b/pkg/kubelet/cm/cpumanager/policy_none.go | 9
b/pkg/kubelet/cm/cpumanager/policy_smtaware.go | 85 ++
b/pkg/kubelet/cm/cpumanager/policy_static.go | 7
b/pkg/kubelet/cm/fake_container_manager.go | 2 The vast majority of the changes are about tests: $ git log -6 -p | diffstat | grep test
/pkg/kubelet/cm/cpumanager/policy_smtaware_test.go | 840 ++++++++++++++++++++
b/pkg/kubelet/cm/cpumanager/cpu_manager_test.go | 158 +++
b/pkg/kubelet/cm/cpumanager/policy_smtaware_test.go | 44 -
pkg/kubelet/cm/cpumanager/cpu_manager_test.go | 5 |
klueska
left a comment
There was a problem hiding this comment.
Left a few comments. If you follow my suggestions, I think the surface error of the change could actually be reduced quite a bit.
That said -- how do you plan to expose this policy to the end user?
As a new value to the --cpu-manager-policy kubelet flag?
None of those changes seem to be part of this PR.
I'm asking becuase...
This is still technically a static cpu-manager-policy. This name is meant to represent the fact that the cpuset granted to pods in the guaranteed QoS class are static, i.e. never change. Your new policy still meet that criteria, but just tweaks it slightly to force whole numbered CPUs to be requested.
As such, does it maybe make sense to introduce a new flag for something like this?
At NVIDIA, we also have a small tweak to the static policy on our internal fork, which I never proposed upstream because I didn't know best to fit it into the existing framework.
If we had a new flag (or a way of attaching attributes to the existing flag), I think that would work alot better. It would also mean that we wouldn't need to define a whole new policy, but instead pass an attributes struct to the existing static policy with knobs for how to customize it.
| func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult { | ||
| klog.InfoS("CPUManager Admit Handler") | ||
| pod := attrs.Pod | ||
| return m.policy.Admit(pod) | ||
| } | ||
|
|
||
| func smtAlignmentError(requestedCPUs, cpusPerCore int) lifecycle.PodAdmitResult { | ||
| errorMessage := fmt.Sprintf("Number of CPUs requested should be a multiple of number of CPUs on a core = %d on this system. Requested CPU count = %d", cpusPerCore, requestedCPUs) | ||
| return lifecycle.PodAdmitResult{ | ||
| Message: errorMessage, | ||
| Reason: "SMTAlignmentError", | ||
| Admit: false, | ||
| } | ||
| } | ||
|
|
||
| func admitPod() lifecycle.PodAdmitResult { | ||
| return lifecycle.PodAdmitResult{ | ||
| Admit: true, | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't think we want a new Admit handler here.
We can consider reworking the way Allocate() propagates errors to the existing Admit handler, but I'm not in favor of adding a new one like this.
| cpusPerCore := p.topology.CPUsPerCore() | ||
| for _, container := range append(pod.Spec.InitContainers, pod.Spec.Containers...) { | ||
| if requestedCPUs := p.guaranteedCPUs(pod, &container); requestedCPUs > 0 && (requestedCPUs%cpusPerCore) != 0 { | ||
| // Since CPU Manager has been enabled with `smtaware` policy, it means a guaranteed pod can only be admitted | ||
| // if the CPU requested is a multiple of the number of virtual cpus per physical cores. | ||
| // In case CPU request is not a multiple of the number of virtual cpus per physical cores the Pod will be put | ||
| // in Failed state, with SMTAlignmentError as reason. Since the allocation happens in terms of physical cores | ||
| // and the scheduler is responsible for ensuring that the workload goes to a node that has enough CPUs, | ||
| // the pod would be placed on a node where there are enough physical cores available to be allocated. | ||
| // Just like the behaviour in case of static policy, takeByTopology will try to first allocate CPUs from the same socket | ||
| // and only in case the request cannot be sattisfied on a single socket, CPU allocation is done for a workload to occupy all | ||
| // CPUs on a physical core. Allocation of individual threads would never have to occur. | ||
| return smtAlignmentError(requestedCPUs, cpusPerCore) | ||
| } | ||
| } |
There was a problem hiding this comment.
I think this logic is better served as an error in the call to Allocate() (which then triggers an admission error since no allocation can be made).
There was a problem hiding this comment.
Our initial approach was indeed to perform this check in the call to Allocate but pivoted to do this at admission stage because it implicitly meant that we would be admitting pods that could result in SMTAlignment error eventually. Having said that, we have no issues moving back to performing the check in Allocate but wanted to explain the reason we thought performing the check in Admit handler was better.
There was a problem hiding this comment.
It seems the only advantage having an explicit Admit() handler gives us, is the ability to more precisely set the “Reason”, which I think could be fixed by updating the way Allocate() propagates its errors back.
I’d also be fine only having an Admit() handler and not have failures to Allocate() trigger an admission error (but then I’d want that to be consistent across all of the components that currently rely on Allocate() for this).
It just seems weird to have an explicit Admit() handler for SMTAlignment, and then leave all other admission errors to the Allocate() call.
There was a problem hiding this comment.
Got it! Thanks for the clarification.
Thanks Kevin for the quick feedback!
That's a very good point. Let me take a step back to fully answer. When we first received feedback about this work, you raised the question about what's the best option to extend cpumanager. And the majority of the feedback received from the community was in the same direction: let's check first what's the direction forward for cpumanager (and, soon emerged, to extend resource managers in general - memory and topology). So we discussed this in sig-node meeting on April 20, and we dedicated a good chunk discussion about sorting the extension approach first. While there it was general consensus that allowing external policies is a nice property, various concerns emerged about (non-comprehensive list): state management, API surface (including the stability promises), how properly delegate core resource management (cpu, memory) to external components, the reliability of the aforementioned components (what if the "cpus" allocator enters a crashloop?). So from our (proponents of this change) perspective it makes sense to narrow it down as much as possible to fit in the current framework, with focused minimal changes, which is the motive for the current PR. Now, answering your question: we are indeed proposing a new policy which totally wants to be a specialized version of the static policy. The best option seemed to be to encapsulate the changes in a new (as much as possible) self-contained policy. But we have some new facts to consider now! (see below).
All considered, this seems a very nice idea. Adding minimal changes and adding attributes to the existing static policy to fine tune its behaviour (which is totally what we want to do here) seems a nice evolution of the existing framework which also plays nicely which the emerging future plans. Let's talk about the UI though. For our usecase, a new option like seems sufficient. For this work it could look like Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA? I'll comment the other points you raised shortly. (cc @kad and @swatisehgal) |
This would work for us. I fact, your change + ours could work together (which is why I think it makes more sense as a set of attributes than a new policy). I'm OK with it being a new flag like this, though if we were to go back in time, I would likely add all Maybe it's still an option to do something like this instead? I'm not that up-to-speed on what the best-practices around new |
This is a fair point and something can explore. However I believe (and expect) that changing back existing options breaks the backward compatibility, so our options are likely limited here. But let's check indeed. |
|
Putting this in "Waiting on Author" as tests are not passing. |
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
|
/test pull-kubernetes-integration |
1 similar comment
|
/test pull-kubernetes-integration |
|
the integration lane failures seems to be way less likely in the EU (GMT+0/1) morning, aka the NA night. We had some success in the earlier runs today (around 9:00 GMT+1), now is failing badly again. |
|
/retest |
1 similar comment
|
/retest |
|
Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs. |
yes. see #103512 and https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration&test=TestMain |
Additionally to what Jordan wrote, last three uploads, which are failing pretty constantly, were rebases to address conflicts in |
|
@klueska just recording that I agree and feel as I did prior during the KEP review that we did not need an additional flag at this time. thanks for summarizing above and helping shephard review. |
|
This is fully approved before the deadline and blocked on a failing test so I don't think it will require an exception, but @fromanirh filed one just in case: https://groups.google.com/g/kubernetes-sig-node/c/19yfkAEAwWU/m/kFaaazlOAwAJ /milestone v1.22 |
|
/retest |
In this patch we enhance the kubelet configuration to support cpuManagerPolicyOptions. In order to introduce SMT-awareness in CPU Manager, we introduce a new flag in Kubelet to allow the user to specify an additional flag called `cpumanager-policy-options` to allow the user to modify the behaviour of static policy to strictly guarantee allocation of whole core. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Files generate after running `make generated_files`. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
The CPUManagerPolicyOptions received from the kubelet config/command line args is propogated to the Container Manager. We defer the consumption of the options to a later patch(set). Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
Introduce a new `admission` subpackage to factor out the responsability to create `PodAdmitResult` objects. This enables resource manager to report specific errors in Allocate() and to bubble up them in the relevant fields of the `PodAdmitResult`. To demonstrate the approach we refactor TopologyAffinityError as a proper error. Co-authored-by: Kevin Klues <kklues@nvidia.com> Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
Consume in the static policy the cpu manager policy options from the cpumanager instance. Validate in the none policy if any option is given, and fail if so - this is almost surely a configuration mistake. Add new cpumanager.Options type to hold the options and translate from user arguments to flags. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
Add e2e tests to cover the basic flows for the `full-pcpus-only` option: negative flow to ensure rejection with proper error message, and positive flow to verify the actual cpu allocation. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>
|
rebased to fix the conflict |
|
/lgtm |
|
thanks @klueska :) |
|
/retest Review the full test history for this PR. Silence the bot with an |
|
phew, the automated retry made it! It seems #103512 is load-related and less likely to happen overnight. |

What type of PR is this?
/kind feature
What this PR does / why we need it:
We propose a change in cpumanager to make the behaviour of latency-sensitive applications more predictable when running on SMT-enabled systems. The existing static policy is allowed to allocate single cpus (aka virtual threads on SMT-enabled systems), and this can cause physical core sharing which in turn is a major cause for noisy neighbours, which harms the behaviour of latency-sensitive applications.
We add extra constraints to the aformenentioned policy, while preserving all its properties, to avoid physical core sharing.
Which issue(s) this PR fixes:
Fixes # N/A
Special notes for your reviewer:
This PR contains E2E tests as well.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.