node: cpumanager: add options to reject non SMT-aligned workload by swatisehgal · Pull Request #101432 · kubernetes/kubernetes

swatisehgal · 2021-04-23T16:32:23Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We propose a change in cpumanager to make the behaviour of latency-sensitive applications more predictable when running on SMT-enabled systems. The existing static policy is allowed to allocate single cpus (aka virtual threads on SMT-enabled systems), and this can cause physical core sharing which in turn is a major cause for noisy neighbours, which harms the behaviour of latency-sensitive applications.

We add extra constraints to the aformenentioned policy, while preserving all its properties, to avoid physical core sharing.

Which issue(s) this PR fixes:

Fixes # N/A

Special notes for your reviewer:

This PR contains E2E tests as well.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.

KEP PR: node: cpumanager: add options to reject non SMT-aligned workload enhancements#2626
Enhancement Issue: node: cpumanager: add options to reject non SMT-aligned workload enhancements#2625

swatisehgal · 2021-04-23T16:37:21Z

/cc @fromanirh

ffromani · 2021-04-23T16:47:24Z

For some more context, on April 20 sig-node meeting we presented this work, but also talked in general about how to extend cpumanager. We thought about demonstrating a prototype, but this change is pretty small and self-contained, so we're presenting the actual complete PR alongside the the KEP (kubernetes/enhancements#2626).

To elaborate about the magnitude and the scope of the changes, the label (XXL) is a bit deceptive:
code changes are actually small:

$ git log -6 -p | diffstat | grep -v test
 b/pkg/kubelet/cm/container_manager_linux.go         |   23 
 b/pkg/kubelet/cm/container_manager_stub.go          |    2 
 b/pkg/kubelet/cm/cpumanager/cpu_manager.go          |   90 +-
 b/pkg/kubelet/cm/cpumanager/fake_cpu_manager.go     |   10 
 b/pkg/kubelet/cm/cpumanager/policy.go               |    5 
 b/pkg/kubelet/cm/cpumanager/policy_none.go          |    9 
 b/pkg/kubelet/cm/cpumanager/policy_smtaware.go      |   85 ++
 b/pkg/kubelet/cm/cpumanager/policy_static.go        |    7 
 b/pkg/kubelet/cm/fake_container_manager.go          |    2

The vast majority of the changes are about tests:

$ git log -6 -p | diffstat | grep test
 /pkg/kubelet/cm/cpumanager/policy_smtaware_test.go  |  840 ++++++++++++++++++++
 b/pkg/kubelet/cm/cpumanager/cpu_manager_test.go     |  158 +++
 b/pkg/kubelet/cm/cpumanager/policy_smtaware_test.go |   44 -
 pkg/kubelet/cm/cpumanager/cpu_manager_test.go       |    5

klueska

Left a few comments. If you follow my suggestions, I think the surface error of the change could actually be reduced quite a bit.

That said -- how do you plan to expose this policy to the end user?
As a new value to the --cpu-manager-policy kubelet flag?
None of those changes seem to be part of this PR.

I'm asking becuase...

This is still technically a static cpu-manager-policy. This name is meant to represent the fact that the cpuset granted to pods in the guaranteed QoS class are static, i.e. never change. Your new policy still meet that criteria, but just tweaks it slightly to force whole numbered CPUs to be requested.

As such, does it maybe make sense to introduce a new flag for something like this?
At NVIDIA, we also have a small tweak to the static policy on our internal fork, which I never proposed upstream because I didn't know best to fit it into the existing framework.

If we had a new flag (or a way of attaching attributes to the existing flag), I think that would work alot better. It would also mean that we wouldn't need to define a whole new policy, but instead pass an attributes struct to the existing static policy with knobs for how to customize it.

klueska · 2021-04-24T11:13:30Z

+func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
+	klog.InfoS("CPUManager Admit Handler")
+	pod := attrs.Pod
+	return m.policy.Admit(pod)
+}
+
+func smtAlignmentError(requestedCPUs, cpusPerCore int) lifecycle.PodAdmitResult {
+	errorMessage := fmt.Sprintf("Number of CPUs requested should be a multiple of number of CPUs on a core = %d on this system. Requested CPU count = %d", cpusPerCore, requestedCPUs)
+	return lifecycle.PodAdmitResult{
+		Message: errorMessage,
+		Reason:  "SMTAlignmentError",
+		Admit:   false,
+	}
+}
+
+func admitPod() lifecycle.PodAdmitResult {
+	return lifecycle.PodAdmitResult{
+		Admit: true,
+	}
+}


I don't think we want a new Admit handler here.

We can consider reworking the way Allocate() propagates errors to the existing Admit handler, but I'm not in favor of adding a new one like this.

klueska · 2021-04-24T11:19:37Z

+	cpusPerCore := p.topology.CPUsPerCore()
+	for _, container := range append(pod.Spec.InitContainers, pod.Spec.Containers...) {
+		if requestedCPUs := p.guaranteedCPUs(pod, &container); requestedCPUs > 0 && (requestedCPUs%cpusPerCore) != 0 {
+			// Since CPU Manager has been enabled with `smtaware` policy, it means a guaranteed pod can only be admitted
+			// if the CPU requested is a multiple of the number of virtual cpus per physical cores.
+			// In case CPU request is not a multiple of the number of virtual cpus per physical cores the Pod will be put
+			// in Failed state, with SMTAlignmentError as reason. Since the allocation happens in terms of physical cores
+			// and the scheduler is responsible for ensuring that the workload goes to a node that has enough CPUs,
+			// the pod would be placed on a node where there are enough physical cores available to be allocated.
+			// Just like the behaviour in case of static policy, takeByTopology will try to first allocate CPUs from the same socket
+			// and only in case the request cannot be sattisfied on a single socket, CPU allocation is done for a workload to occupy all
+			// CPUs on a physical core. Allocation of individual threads would never have to occur.
+			return smtAlignmentError(requestedCPUs, cpusPerCore)
+		}
+	}


I think this logic is better served as an error in the call to Allocate() (which then triggers an admission error since no allocation can be made).

Our initial approach was indeed to perform this check in the call to Allocate but pivoted to do this at admission stage because it implicitly meant that we would be admitting pods that could result in SMTAlignment error eventually. Having said that, we have no issues moving back to performing the check in Allocate but wanted to explain the reason we thought performing the check in Admit handler was better.

It seems the only advantage having an explicit Admit() handler gives us, is the ability to more precisely set the “Reason”, which I think could be fixed by updating the way Allocate() propagates its errors back.

I’d also be fine only having an Admit() handler and not have failures to Allocate() trigger an admission error (but then I’d want that to be consistent across all of the components that currently rely on Allocate() for this).

It just seems weird to have an explicit Admit() handler for SMTAlignment, and then leave all other admission errors to the Allocate() call.

Got it! Thanks for the clarification.

ffromani · 2021-04-25T09:35:29Z

Left a few comments. If you follow my suggestions, I think the surface error of the change could actually be reduced quite a bit.

Thanks Kevin for the quick feedback!

That said -- how do you plan to expose this policy to the end user?
As a new value to the --cpu-manager-policy kubelet flag?
None of those changes seem to be part of this PR.

I'm asking becuase...

This is still technically a static cpu-manager-policy. This name is meant to represent the fact that the cpuset granted to pods in the guaranteed QoS class are static, i.e. never change. Your new policy still meet that criteria, but just tweaks it slightly to force whole numbered CPUs to be requested.

That's a very good point. Let me take a step back to fully answer.

When we first received feedback about this work, you raised the question about what's the best option to extend cpumanager. And the majority of the feedback received from the community was in the same direction: let's check first what's the direction forward for cpumanager (and, soon emerged, to extend resource managers in general - memory and topology).

So we discussed this in sig-node meeting on April 20, and we dedicated a good chunk discussion about sorting the extension approach first.

While there it was general consensus that allowing external policies is a nice property, various concerns emerged about (non-comprehensive list): state management, API surface (including the stability promises), how properly delegate core resource management (cpu, memory) to external components, the reliability of the aforementioned components (what if the "cpus" allocator enters a crashloop?).
It seems there is a consensus forming towards moving the resource management responsabilities into the runtime and towards NRI.

So from our (proponents of this change) perspective it makes sense to narrow it down as much as possible to fit in the current framework, with focused minimal changes, which is the motive for the current PR.

Now, answering your question: we are indeed proposing a new policy which totally wants to be a specialized version of the static policy. The best option seemed to be to encapsulate the changes in a new (as much as possible) self-contained policy. But we have some new facts to consider now! (see below).

As such, does it maybe make sense to introduce a new flag for something like this?
At NVIDIA, we also have a small tweak to the static policy on our internal fork, which I never proposed upstream because I didn't know best to fit it into the existing framework.

If we had a new flag (or a way of attaching attributes to the existing flag), I think that would work alot better. It would also mean that we wouldn't need to define a whole new policy, but instead pass an attributes struct to the existing static policy with knobs for how to customize it.

All considered, this seems a very nice idea. Adding minimal changes and adding attributes to the existing static policy to fine tune its behaviour (which is totally what we want to do here) seems a nice evolution of the existing framework which also plays nicely which the emerging future plans.

Let's talk about the UI though. For our usecase, a new option like

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]

seems sufficient. For this work it could look like

--cpu-manager-policy-options=smtaware

Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?

I'll comment the other points you raised shortly.

(cc @kad and @swatisehgal)

klueska · 2021-04-26T14:56:57Z

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]
seems sufficient. For this work it could look like
--cpu-manager-policy-options=smtaware
Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?

This would work for us. I fact, your change + ours could work together (which is why I think it makes more sense as a set of attributes than a new policy).

I'm OK with it being a new flag like this, though if we were to go back in time, I would likely add all CPUManager options as as entries in a KubeletConfiguration (and not have any actual flags for them):

cpuManager:
    ReconcilePeriod: 
    Policy:
        Name:
        Options:

Maybe it's still an option to do something like this instead? I'm not that up-to-speed on what the best-practices around new kubelet flags / options are these days. Maybe someone else with more familiarity can comment.

ffromani · 2021-04-26T15:00:37Z

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]
seems sufficient. For this work it could look like
--cpu-manager-policy-options=smtaware
Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?
This would work for us. I fact, your change + ours could work together (which is why I think it makes more sense as a set of attributes than a new policy).

I'm OK with it being a new flag like this, though if we were to go back in time, I would likely add all CPUManager options as as entries in a KubeletConfiguration (and not have any actual flags for them):
cpuManager:
    ReconcilePeriod: 
    Policy:
        Name:
        Options:
Maybe it's still an option to do something like this instead? I'm not that up-to-speed on what the best-practices around new kubelet flags / options are these days. Maybe someone else with more familiarity can comment.

This is a fair point and something can explore. However I believe (and expect) that changing back existing options breaks the backward compatibility, so our options are likely limited here. But let's check indeed.

ehashman · 2021-04-28T18:01:51Z

Putting this in "Waiting on Author" as tests are not passing.

fejta-bot · 2021-04-29T19:04:17Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

ffromani · 2021-07-08T11:59:37Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

ffromani · 2021-07-08T13:48:28Z

interesting:

ffromani · 2021-07-08T13:48:55Z

/test pull-kubernetes-integration

ffromani · 2021-07-08T14:36:19Z

/test pull-kubernetes-integration

ffromani · 2021-07-08T15:12:33Z

the integration lane failures seems to be way less likely in the EU (GMT+0/1) morning, aka the NA night. We had some success in the earlier runs today (around 9:00 GMT+1), now is failing badly again.

ffromani · 2021-07-08T15:22:37Z

/retest

ffromani · 2021-07-08T16:55:10Z

/retest

ehashman · 2021-07-08T17:43:36Z

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

liggitt · 2021-07-08T17:58:21Z

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

yes. see #103512 and https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration&test=TestMain

ffromani · 2021-07-08T18:23:55Z

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

Additionally to what Jordan wrote, last three uploads, which are failing pretty constantly, were rebases to address conflicts in pkg/features.

derekwaynecarr · 2021-07-08T18:32:37Z

@klueska just recording that I agree and feel as I did prior during the KEP review that we did not need an additional flag at this time. thanks for summarizing above and helping shephard review.

see:
https://github.com/kubernetes/enhancements/pull/2626/files#diff-ba3f042638632df15dfd028bd719d2a645d6f5681b1ccbae355320315e3a96a9R238

ehashman · 2021-07-08T19:06:13Z

This is fully approved before the deadline and blocked on a failing test so I don't think it will require an exception, but @fromanirh filed one just in case: https://groups.google.com/g/kubernetes-sig-node/c/19yfkAEAwWU/m/kFaaazlOAwAJ

/milestone v1.22

ffromani · 2021-07-08T20:22:41Z

/retest

In this patch we enhance the kubelet configuration to support cpuManagerPolicyOptions. In order to introduce SMT-awareness in CPU Manager, we introduce a new flag in Kubelet to allow the user to specify an additional flag called `cpumanager-policy-options` to allow the user to modify the behaviour of static policy to strictly guarantee allocation of whole core. Co-authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

Files generate after running `make generated_files`. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

The CPUManagerPolicyOptions received from the kubelet config/command line args is propogated to the Container Manager. We defer the consumption of the options to a later patch(set). Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

Introduce a new `admission` subpackage to factor out the responsability to create `PodAdmitResult` objects. This enables resource manager to report specific errors in Allocate() and to bubble up them in the relevant fields of the `PodAdmitResult`. To demonstrate the approach we refactor TopologyAffinityError as a proper error. Co-authored-by: Kevin Klues <kklues@nvidia.com> Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

Consume in the static policy the cpu manager policy options from the cpumanager instance. Validate in the none policy if any option is given, and fail if so - this is almost surely a configuration mistake. Add new cpumanager.Options type to hold the options and translate from user arguments to flags. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

Add e2e tests to cover the basic flows for the `full-pcpus-only` option: negative flow to ensure rejection with proper error message, and positive flow to verify the actual cpu allocation. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2021-07-08T21:16:08Z

rebased to fix the conflict

klueska · 2021-07-08T21:17:02Z

/lgtm

ffromani · 2021-07-08T21:17:32Z

thanks @klueska :)

fejta-bot · 2021-07-09T01:04:54Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

ffromani · 2021-07-09T04:39:35Z

phew, the automated retry made it! It seems #103512 is load-related and less likely to happen overnight.

k8s-ci-robot requested review from feiskyer and klueska April 23, 2021 16:35

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 23, 2021

k8s-ci-robot requested a review from ffromani April 23, 2021 16:37

ffromani mentioned this pull request Apr 23, 2021

node: cpumanager: add options to reject non SMT-aligned workload kubernetes/enhancements#2626

Merged

swatisehgal force-pushed the smtaware branch from 9e8c378 to 77453bc Compare April 23, 2021 18:58

klueska reviewed Apr 24, 2021

View reviewed changes

swatisehgal force-pushed the smtaware branch from 77453bc to 5f9aa6b Compare April 29, 2021 16:36

ffromani force-pushed the smtaware branch from 5f9aa6b to ea7e301 Compare April 29, 2021 17:35

swatisehgal and others added 6 commits July 8, 2021 23:14

smtalign: add auto generated code

6dccad4

Files generate after running `make generated_files`. Co-authored-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

tengqm mentioned this pull request Jul 11, 2021

Batch update feature gates for 1.22 kubernetes/website#28849

Merged

swatisehgal mentioned this pull request Sep 2, 2021

KEP-2625: Update CPU Manager Policy Options 1.23 Beta kubernetes/enhancements#2933

Merged

ffromani mentioned this pull request Sep 17, 2021

Feature: Add StrictCPUReservation option in static cpu policy #103933

Closed

swatisehgal mentioned this pull request Sep 30, 2021

REQUEST: New kubernetes-sigs membership for @swatisehgal kubernetes/org#3023

Closed

7 tasks

klueska mentioned this pull request Oct 31, 2022

node: topologymanager: Improved multi-numa alignment in Topology Manager #112914

Merged

ffromani mentioned this pull request Jan 30, 2025

KEP-2625: Promote the CPUManager Policy Option full-pcpus-only to GA kubernetes/enhancements#5108

Merged

Conversation

swatisehgal commented Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.

Uh oh!

swatisehgal commented Apr 23, 2021

Uh oh!

ffromani commented Apr 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klueska left a comment

Choose a reason for hiding this comment

Uh oh!

klueska Apr 24, 2021

Choose a reason for hiding this comment

Uh oh!

klueska Apr 24, 2021

Choose a reason for hiding this comment

Uh oh!

swatisehgal Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

swatisehgal Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

ffromani commented Apr 25, 2021

Uh oh!

klueska commented Apr 26, 2021

Uh oh!

ffromani commented Apr 26, 2021

Uh oh!

ehashman commented Apr 28, 2021

Uh oh!

fejta-bot commented Apr 29, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ehashman commented Jul 8, 2021

Uh oh!

liggitt commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

derekwaynecarr commented Jul 8, 2021

Uh oh!

ehashman commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

klueska commented Jul 8, 2021

Uh oh!

ffromani commented Jul 8, 2021

Uh oh!

fejta-bot commented Jul 9, 2021

Uh oh!

ffromani commented Jul 9, 2021

Uh oh!

swatisehgal commented Apr 23, 2021 •

edited

Loading

ffromani commented Apr 23, 2021 •

edited

Loading

swatisehgal Apr 26, 2021 •

edited

Loading