Skip to content

node: cpumanager: add options to reject non SMT-aligned workload#101432

Merged
k8s-ci-robot merged 6 commits intokubernetes:masterfrom
swatisehgal:smtaware
Jul 9, 2021
Merged

node: cpumanager: add options to reject non SMT-aligned workload#101432
k8s-ci-robot merged 6 commits intokubernetes:masterfrom
swatisehgal:smtaware

Conversation

@swatisehgal
Copy link
Copy Markdown
Contributor

@swatisehgal swatisehgal commented Apr 23, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

We propose a change in cpumanager to make the behaviour of latency-sensitive applications more predictable when running on SMT-enabled systems. The existing static policy is allowed to allocate single cpus (aka virtual threads on SMT-enabled systems), and this can cause physical core sharing which in turn is a major cause for noisy neighbours, which harms the behaviour of latency-sensitive applications.

We add extra constraints to the aformenentioned policy, while preserving all its properties, to avoid physical core sharing.

Which issue(s) this PR fixes:

Fixes # N/A

Special notes for your reviewer:

This PR contains E2E tests as well.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels Apr 23, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 23, 2021
@swatisehgal
Copy link
Copy Markdown
Contributor Author

/cc @fromanirh

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Apr 23, 2021

For some more context, on April 20 sig-node meeting we presented this work, but also talked in general about how to extend cpumanager. We thought about demonstrating a prototype, but this change is pretty small and self-contained, so we're presenting the actual complete PR alongside the the KEP (kubernetes/enhancements#2626).

To elaborate about the magnitude and the scope of the changes, the label (XXL) is a bit deceptive:
code changes are actually small:

$ git log -6 -p | diffstat | grep -v test
 b/pkg/kubelet/cm/container_manager_linux.go         |   23 
 b/pkg/kubelet/cm/container_manager_stub.go          |    2 
 b/pkg/kubelet/cm/cpumanager/cpu_manager.go          |   90 +-
 b/pkg/kubelet/cm/cpumanager/fake_cpu_manager.go     |   10 
 b/pkg/kubelet/cm/cpumanager/policy.go               |    5 
 b/pkg/kubelet/cm/cpumanager/policy_none.go          |    9 
 b/pkg/kubelet/cm/cpumanager/policy_smtaware.go      |   85 ++
 b/pkg/kubelet/cm/cpumanager/policy_static.go        |    7 
 b/pkg/kubelet/cm/fake_container_manager.go          |    2 

The vast majority of the changes are about tests:

$ git log -6 -p | diffstat | grep test
 /pkg/kubelet/cm/cpumanager/policy_smtaware_test.go  |  840 ++++++++++++++++++++
 b/pkg/kubelet/cm/cpumanager/cpu_manager_test.go     |  158 +++
 b/pkg/kubelet/cm/cpumanager/policy_smtaware_test.go |   44 -
 pkg/kubelet/cm/cpumanager/cpu_manager_test.go       |    5 

Copy link
Copy Markdown
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. If you follow my suggestions, I think the surface error of the change could actually be reduced quite a bit.

That said -- how do you plan to expose this policy to the end user?
As a new value to the --cpu-manager-policy kubelet flag?
None of those changes seem to be part of this PR.

I'm asking becuase...

This is still technically a static cpu-manager-policy. This name is meant to represent the fact that the cpuset granted to pods in the guaranteed QoS class are static, i.e. never change. Your new policy still meet that criteria, but just tweaks it slightly to force whole numbered CPUs to be requested.

As such, does it maybe make sense to introduce a new flag for something like this?
At NVIDIA, we also have a small tweak to the static policy on our internal fork, which I never proposed upstream because I didn't know best to fit it into the existing framework.

If we had a new flag (or a way of attaching attributes to the existing flag), I think that would work alot better. It would also mean that we wouldn't need to define a whole new policy, but instead pass an attributes struct to the existing static policy with knobs for how to customize it.

Comment on lines +478 to +497
func (m *manager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
klog.InfoS("CPUManager Admit Handler")
pod := attrs.Pod
return m.policy.Admit(pod)
}

func smtAlignmentError(requestedCPUs, cpusPerCore int) lifecycle.PodAdmitResult {
errorMessage := fmt.Sprintf("Number of CPUs requested should be a multiple of number of CPUs on a core = %d on this system. Requested CPU count = %d", cpusPerCore, requestedCPUs)
return lifecycle.PodAdmitResult{
Message: errorMessage,
Reason: "SMTAlignmentError",
Admit: false,
}
}

func admitPod() lifecycle.PodAdmitResult {
return lifecycle.PodAdmitResult{
Admit: true,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a new Admit handler here.

We can consider reworking the way Allocate() propagates errors to the existing Admit handler, but I'm not in favor of adding a new one like this.

Comment on lines +69 to +83
cpusPerCore := p.topology.CPUsPerCore()
for _, container := range append(pod.Spec.InitContainers, pod.Spec.Containers...) {
if requestedCPUs := p.guaranteedCPUs(pod, &container); requestedCPUs > 0 && (requestedCPUs%cpusPerCore) != 0 {
// Since CPU Manager has been enabled with `smtaware` policy, it means a guaranteed pod can only be admitted
// if the CPU requested is a multiple of the number of virtual cpus per physical cores.
// In case CPU request is not a multiple of the number of virtual cpus per physical cores the Pod will be put
// in Failed state, with SMTAlignmentError as reason. Since the allocation happens in terms of physical cores
// and the scheduler is responsible for ensuring that the workload goes to a node that has enough CPUs,
// the pod would be placed on a node where there are enough physical cores available to be allocated.
// Just like the behaviour in case of static policy, takeByTopology will try to first allocate CPUs from the same socket
// and only in case the request cannot be sattisfied on a single socket, CPU allocation is done for a workload to occupy all
// CPUs on a physical core. Allocation of individual threads would never have to occur.
return smtAlignmentError(requestedCPUs, cpusPerCore)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic is better served as an error in the call to Allocate() (which then triggers an admission error since no allocation can be made).

Copy link
Copy Markdown
Contributor Author

@swatisehgal swatisehgal Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our initial approach was indeed to perform this check in the call to Allocate but pivoted to do this at admission stage because it implicitly meant that we would be admitting pods that could result in SMTAlignment error eventually. Having said that, we have no issues moving back to performing the check in Allocate but wanted to explain the reason we thought performing the check in Admit handler was better.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the only advantage having an explicit Admit() handler gives us, is the ability to more precisely set the “Reason”, which I think could be fixed by updating the way Allocate() propagates its errors back.

I’d also be fine only having an Admit() handler and not have failures to Allocate() trigger an admission error (but then I’d want that to be consistent across all of the components that currently rely on Allocate() for this).

It just seems weird to have an explicit Admit() handler for SMTAlignment, and then leave all other admission errors to the Allocate() call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for the clarification.

@ffromani
Copy link
Copy Markdown
Contributor

Left a few comments. If you follow my suggestions, I think the surface error of the change could actually be reduced quite a bit.

Thanks Kevin for the quick feedback!

That said -- how do you plan to expose this policy to the end user?
As a new value to the --cpu-manager-policy kubelet flag?
None of those changes seem to be part of this PR.

I'm asking becuase...

This is still technically a static cpu-manager-policy. This name is meant to represent the fact that the cpuset granted to pods in the guaranteed QoS class are static, i.e. never change. Your new policy still meet that criteria, but just tweaks it slightly to force whole numbered CPUs to be requested.

That's a very good point. Let me take a step back to fully answer.

When we first received feedback about this work, you raised the question about what's the best option to extend cpumanager. And the majority of the feedback received from the community was in the same direction: let's check first what's the direction forward for cpumanager (and, soon emerged, to extend resource managers in general - memory and topology).

So we discussed this in sig-node meeting on April 20, and we dedicated a good chunk discussion about sorting the extension approach first.

While there it was general consensus that allowing external policies is a nice property, various concerns emerged about (non-comprehensive list): state management, API surface (including the stability promises), how properly delegate core resource management (cpu, memory) to external components, the reliability of the aforementioned components (what if the "cpus" allocator enters a crashloop?).
It seems there is a consensus forming towards moving the resource management responsabilities into the runtime and towards NRI.

So from our (proponents of this change) perspective it makes sense to narrow it down as much as possible to fit in the current framework, with focused minimal changes, which is the motive for the current PR.

Now, answering your question: we are indeed proposing a new policy which totally wants to be a specialized version of the static policy. The best option seemed to be to encapsulate the changes in a new (as much as possible) self-contained policy. But we have some new facts to consider now! (see below).

As such, does it maybe make sense to introduce a new flag for something like this?
At NVIDIA, we also have a small tweak to the static policy on our internal fork, which I never proposed upstream because I didn't know best to fit it into the existing framework.

If we had a new flag (or a way of attaching attributes to the existing flag), I think that would work alot better. It would also mean that we wouldn't need to define a whole new policy, but instead pass an attributes struct to the existing static policy with knobs for how to customize it.

All considered, this seems a very nice idea. Adding minimal changes and adding attributes to the existing static policy to fine tune its behaviour (which is totally what we want to do here) seems a nice evolution of the existing framework which also plays nicely which the emerging future plans.

Let's talk about the UI though. For our usecase, a new option like

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]

seems sufficient. For this work it could look like

--cpu-manager-policy-options=smtaware

Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?

I'll comment the other points you raised shortly.

(cc @kad and @swatisehgal)

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Apr 26, 2021

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]

seems sufficient. For this work it could look like

--cpu-manager-policy-options=smtaware

Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?

This would work for us. I fact, your change + ours could work together (which is why I think it makes more sense as a set of attributes than a new policy).

I'm OK with it being a new flag like this, though if we were to go back in time, I would likely add all CPUManager options as as entries in a KubeletConfiguration (and not have any actual flags for them):

cpuManager:
    ReconcilePeriod: 
    Policy:
        Name:
        Options:

Maybe it's still an option to do something like this instead? I'm not that up-to-speed on what the best-practices around new kubelet flags / options are these days. Maybe someone else with more familiarity can comment.

@ffromani
Copy link
Copy Markdown
Contributor

--cpu-manager-policy-options=[flag1],[flag2],[...flagn]

seems sufficient. For this work it could look like

--cpu-manager-policy-options=smtaware

Do you believe this UI would be sufficient to cover the use case(s) you have in NVIDIA?

This would work for us. I fact, your change + ours could work together (which is why I think it makes more sense as a set of attributes than a new policy).

I'm OK with it being a new flag like this, though if we were to go back in time, I would likely add all CPUManager options as as entries in a KubeletConfiguration (and not have any actual flags for them):

cpuManager:
    ReconcilePeriod: 
    Policy:
        Name:
        Options:

Maybe it's still an option to do something like this instead? I'm not that up-to-speed on what the best-practices around new kubelet flags / options are these days. Maybe someone else with more familiarity can comment.

This is a fair point and something can explore. However I believe (and expect) that changing back existing options breaks the backward compatibility, so our options are likely limited here. But let's check indeed.

@ehashman
Copy link
Copy Markdown
Member

Putting this in "Waiting on Author" as tests are not passing.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 29, 2021
@fejta-bot
Copy link
Copy Markdown

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

interesting:
image

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/test pull-kubernetes-integration

1 similar comment
@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/test pull-kubernetes-integration

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

the integration lane failures seems to be way less likely in the EU (GMT+0/1) morning, aka the NA night. We had some success in the earlier runs today (around 9:00 GMT+1), now is failing badly again.

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/retest

1 similar comment
@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/retest

@ehashman
Copy link
Copy Markdown
Member

ehashman commented Jul 8, 2021

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jul 8, 2021

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

yes. see #103512 and https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=integration&test=TestMain

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

Are we certain the integration test failures here are a flake? I haven't seen these sorts of repeated failures on other PRs.

Additionally to what Jordan wrote, last three uploads, which are failing pretty constantly, were rebases to address conflicts in pkg/features.

@derekwaynecarr
Copy link
Copy Markdown
Member

@klueska just recording that I agree and feel as I did prior during the KEP review that we did not need an additional flag at this time. thanks for summarizing above and helping shephard review.

see:
https://github.com/kubernetes/enhancements/pull/2626/files#diff-ba3f042638632df15dfd028bd719d2a645d6f5681b1ccbae355320315e3a96a9R238

@ehashman
Copy link
Copy Markdown
Member

ehashman commented Jul 8, 2021

This is fully approved before the deadline and blocked on a failing test so I don't think it will require an exception, but @fromanirh filed one just in case: https://groups.google.com/g/kubernetes-sig-node/c/19yfkAEAwWU/m/kFaaazlOAwAJ

/milestone v1.22

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

/retest

swatisehgal and others added 6 commits July 8, 2021 23:14
In this patch we enhance the kubelet configuration to support
cpuManagerPolicyOptions.

In order to introduce SMT-awareness in CPU Manager, we introduce a
new flag in Kubelet to allow the user to specify an additional flag
called `cpumanager-policy-options` to allow the user to modify the
behaviour of static policy to strictly guarantee allocation of whole
core.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Files generate after running `make generated_files`.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
The CPUManagerPolicyOptions received from the kubelet config/command line args
is propogated to the Container Manager.

We defer the consumption of the options to a later patch(set).

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Introduce a new `admission` subpackage to factor out the responsability
to create `PodAdmitResult` objects. This enables resource manager
to report specific errors in Allocate() and to bubble up them
in the relevant fields of the `PodAdmitResult`.

To demonstrate the approach we refactor TopologyAffinityError as a
proper error.

Co-authored-by: Kevin Klues <kklues@nvidia.com>
Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Consume in the static policy the cpu manager policy options from
the cpumanager instance.
Validate in the none policy if any option is given, and fail if so -
this is almost surely a configuration mistake.

Add new cpumanager.Options type to hold the options and translate from
user arguments to flags.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

rebased to fix the conflict

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Jul 8, 2021

/lgtm

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 8, 2021

thanks @klueska :)

@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Jul 9, 2021

phew, the automated retry made it! It seems #103512 is load-related and less likely to happen overnight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Status: API review completed, 1.22
Archived in project

Development

Successfully merging this pull request may close these issues.