Enforce Node Allocatable via cgroups#41234
Conversation
ba4acf0 to
8c7160c
Compare
6158204 to
703fb50
Compare
|
@derekwaynecarr @dashpole @sjenning Node e2e (finally) passes for this PR. I'd appreciate a quick review since @dashpole intends to add evictions on top of this. |
derekwaynecarr
left a comment
There was a problem hiding this comment.
A number of comments and requests for changes throughout. I was confused on the getNodeAllocatable stuff, so if we can name or structure things a little differently, I will be less confused when we look again at the code.
| fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
| fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
|
|
||
| fs.StringSliceVar(&s.EnforceNodeAllocatable, "enforce-node-allocatable", s.EnforceNodeAllocatable, "A comma separated list of levels of node allocatable enforcement to be enforced by kubelet. Acceptible options are 'pods', 'system-reserved' & 'kube-reserved'. If the latter two options are specified, '--system-reserved-cgroup' & '--kube-reserved-cgroup' must also be set respectively. See https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md for more details. [default='']") |
There was a problem hiding this comment.
i thought default was going to be 'pods'
also does the flag library not pull and display the default for us, or is there something about StringSliceVar that I am missing? I would need to verify to check, but for most other things, the default is rendered based on defaults.go value.
There was a problem hiding this comment.
It should render the defaults, but I suspect the default string value is meant for automated docs.
| if !s.CgroupsPerQOS && len(s.EnforceNodeAllocatable) > 0 { | ||
| return fmt.Errorf("Node Allocatable enforcement is not supported unless Cgroups Per QOS feature is turned on") | ||
| } | ||
| if s.SystemCgroups != "" && s.CgroupRoot == "" { |
There was a problem hiding this comment.
out of curiosity, are you making use of system cgroups flag? wondering if we can just deprecate it in the future and have the operator take over the responsibility of parenting daemons correctly from the start.
There was a problem hiding this comment.
It is still being used in the old debian based distro that GKE uses. We can deprecate it once GKE stops using that distro.
There was a problem hiding this comment.
sgtm - not needed by this pr, but we should probably put a big warning around those flags that you should not use them if your init system already places your stuff in a cgroup. maybe even error out completely if they are used and you run systemd.
| NodeAllocatableConfig: cm.NodeAllocatableConfig{ | ||
| KubeReservedCgroupName: s.KubeReservedCgroup, | ||
| SystemReservedCgroupName: s.SystemReservedCgroup, | ||
| EnforceNodeAllocatable: sets.NewString(s.EnforceNodeAllocatable...), |
There was a problem hiding this comment.
its possible we validate this value later, but just recording here for reference that we don't validate the items in this list are actual valid options. i also wonder if we want to treat this is an opaque set of strings, or a more concrete type.
There was a problem hiding this comment.
ack that this is now validated.
| zeroDuration = metav1.Duration{} | ||
| // Refer to [Node Allocatable](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) doc for more information. | ||
| // TODO: Set the default to "pods" once cgroups per qos is turned on by default. | ||
| defaultNodeAllocatableEnforcement = []string{} |
There was a problem hiding this comment.
ah, i see! ignore my earlier comment ;-)
| nodeInfo *v1.Node | ||
| cgroupManager CgroupManager | ||
| capacity v1.ResourceList | ||
| cgroupRoot string |
There was a problem hiding this comment.
can you leave a comment just calling out that this is the literal cgroupfs value so its not forgotten?
| return &rc | ||
| } | ||
|
|
||
| func (cm *containerManagerImpl) getNodeAllocatableInternal(includeHardEviction bool) v1.ResourceList { |
There was a problem hiding this comment.
is there a test case for this? i need to pull and physically test your pr to make sure it does what i think it is doing.
There was a problem hiding this comment.
i thought we need to handle two cases:
- allocatable reported to scheduler which was capacity - reservations
- allocatable enforced via cgroup which was min(capacity - reservations + reservation, capacity)
There was a problem hiding this comment.
after more thought, i think this is right.
so for the 32Gi, 2 gi reserve, 100Mi threshold, this would given 29.9Gi allocatable, and 30Gi enforced. sorry, its late on a friday and this is complicated ;-)
i think the issue is its usage of includeHardEviction..
it would be clearer if called something like: getNodeAllocatableFor(scheduling | enforcement)
There was a problem hiding this comment.
Renaming sounds useful. I will make the changes. Unit tests do exist.
|
|
||
| // Enforce Node Allocatable (if required) | ||
| if err := enforceNodeAllocatableCgroups(cm.NodeConfig.NodeAllocatableConfig, | ||
| cm.getNodeAllocatableInternal(false), // ignore eviction thresholds. |
There was a problem hiding this comment.
i thought kubepods needs to have allocatable = capacity - reserved + thresholds enforced. right now, this appears to not give the buffer discussed.
| NodeConfig: nc, | ||
| capacity: tc.capacity, | ||
| } | ||
| for k, v := range cm.GetNodeAllocatable() { |
There was a problem hiding this comment.
this is confusing me. see my earlier comment on the internal version of the method, but this doesn't appear correct, and at minimum is not testing both patterns where we pass true or false to the operation.
There was a problem hiding this comment.
after further reflection, i think we just need a test that tests the enfored values and a test that verifies the reported to scheduler value so its clearer. i think this is correct for the as reported to scheduler case.
There was a problem hiding this comment.
Ack. I can split up the tests accordingly tomorrow.
| hardThreshold evictionapi.ThresholdValue | ||
| }{ | ||
| { | ||
| kubeReserved: getResourceList("100m", "100Mi"), |
There was a problem hiding this comment.
add a case for the 0 expected resources scenario when thresholds cause us to exceed capacity.
There was a problem hiding this comment.
Adding a check in code and a test case
| var _ = framework.KubeDescribe("Kubelet Container Manager [Serial]", func() { | ||
| f := framework.NewDefaultFramework("kubelet-container-manager") | ||
|
|
||
| Describe("Validate Node Allocatable", func() { |
There was a problem hiding this comment.
@sjenning -- this seems like a pattern you could follow for your work in e2e
dashpole
left a comment
There was a problem hiding this comment.
Took me a while, but I think I understand the changes. Other than @derekwaynecarr's comments about validation, I found the allocatable code a bit difficult to read. I think we need to make sure that everything is either a "threshold", "reservation", "limit", or "resource", since an "allocatable" doesn't have a concrete meaning.
| fs.Int64Var(&s.MaxOpenFiles, "max-open-files", s.MaxOpenFiles, "Number of files that can be opened by Kubelet process. [default=1000000]") | ||
| fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
| fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
|
|
| // Node Allocatable Flags | ||
| fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
| fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]") | ||
|
|
| } | ||
|
|
||
| // GetNodeAllocatable returns amount of compute resource available for pods. | ||
| func (cm *containerManagerImpl) GetNodeAllocatable() v1.ResourceList { |
There was a problem hiding this comment.
Since this is an eviction threshold, I would like to see the word threshold in this function name
There was a problem hiding this comment.
Or on second thought, GetNodeAllocatableResources() would also work.
There was a problem hiding this comment.
This is the real NodeAllocatable from system and users standpoint.
The internal one is just meant for enforcing the real NodeAllocatable within kubelet.
If we were to rename, I'd rename the internal one.
| return &rc | ||
| } | ||
|
|
||
| func (cm *containerManagerImpl) getNodeAllocatableInternal(includeHardEviction bool) v1.ResourceList { |
There was a problem hiding this comment.
nit: I would love a comment explaining the purpose of this function.
| return nil | ||
| } | ||
|
|
||
| func enforceExistingCgroup(cgroupManager CgroupManager, cName string, rl v1.ResourceList) error { |
There was a problem hiding this comment.
nit: comment explaining function
|
Err. rebase :( |
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
|
Bumping the priority further up because there are about 4+ PRs that are built on top of this which needs to go into v1.6 |
|
adding LGTM label after rebasing. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED The following people have approved this PR: derekwaynecarr, k8s-merge-robot, vishh Needs approval from an approver in each of these OWNERS Files:
We suggest the following people: |
|
Automatic merge from submit-queue |
Automatic merge from submit-queue Critial pod test uses allocatable instead of capacity This solves #42239. When this test was first introduced, pods could request up to the capacity of the node. With the addition of allocatable introduced in #41234, this is no longer the case, and pods can only use up to allocatable. This should be included in 1.6, as it is a bug related to a 1.6 feature. cc @vish @yujuhong
Automatic merge from submit-queue Eviction Manager Enforces Allocatable Thresholds This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348. This PR should be merged after #41234. cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh ** Why is this a bug/regression** Kubelet uses `oom_score_adj` to enforce QoS policies. But the `oom_score_adj` is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer. Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature `out of Resource` handling.
| // Setup event recorder if required. | ||
| makeEventRecorder(&s.KubeletConfiguration, kubeDeps, nodeName) |
There was a problem hiding this comment.
Hi @vishh , while reading the source code related EventRecorder in kubelet, I noticed that makeEventRecorder is called twice in the run function: one before ContainerManager initialization and once in the RunKubelet function. So i'm here to ask why makeEventRecorder is called before ContainerManager initialization. Thank you! 🙏
This PR enforces node allocatable across all pods using a top level cgroup as described in kubernetes/community#348
This PR also provides an option to enforce
kubeReservedandsystemReservedon user specified cgroups.This PR will by default make kubelet create top level cgroups even if
kubeReservedandsystemReservedis not specified and henceAllocatable = Capacity.cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests
TODO:
@dashpole is working on adding support for evictions for enforcing Node allocatable more gracefully. That work will show up in a subsequent PR for v1.6