Skip to content

Enforce Node Allocatable via cgroups#41234

Merged
k8s-github-robot merged 6 commits into
kubernetes:masterfrom
vishh:nodeaphase2
Feb 28, 2017
Merged

Enforce Node Allocatable via cgroups#41234
k8s-github-robot merged 6 commits into
kubernetes:masterfrom
vishh:nodeaphase2

Conversation

@vishh

@vishh vishh commented Feb 10, 2017

Copy link
Copy Markdown
Contributor

This PR enforces node allocatable across all pods using a top level cgroup as described in kubernetes/community#348

This PR also provides an option to enforce kubeReserved and systemReserved on user specified cgroups.

This PR will by default make kubelet create top level cgroups even if kubeReserved and systemReserved is not specified and hence Allocatable = Capacity.

New Kubelet flag `--enforce-node-allocatable` with a default value of `pods` is added which will make kubelet create a top level cgroup for all pods to enforce Node Allocatable. Optionally, `system-reserved` & `kube-reserved` values can also be specified separated by comma to enforce node allocatable on cgroups specified via `--system-reserved-cgroup` & `--kube-reserved-cgroup` respectively. Note the default value of the latter flags are "".
This feature requires a **Node Drain** prior to upgrade failing which pods will be restarted if possible or terminated if they have a `RestartNever` policy.

cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests

TODO:

  • Adjust effective Node Allocatable to subtract hard eviction thresholds
  • Add unit tests
  • Complete pending e2e tests
  • Manual testing
  • Get the proposal merged

@dashpole is working on adding support for evictions for enforcing Node allocatable more gracefully. That work will show up in a subsequent PR for v1.6

@vishh vishh requested a review from dchen1107 February 10, 2017 05:22
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 10, 2017
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 10, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 10, 2017
@k8s-reviewable

Copy link
Copy Markdown

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2017
@vishh vishh force-pushed the nodeaphase2 branch 3 times, most recently from ba4acf0 to 8c7160c Compare February 15, 2017 20:52
@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2017
@vishh vishh changed the title [WIP] Enforce Node Allocatable via cgroups Enforce Node Allocatable via cgroups Feb 17, 2017
@vishh vishh added this to the v1.6 milestone Feb 17, 2017
@vishh vishh force-pushed the nodeaphase2 branch 3 times, most recently from 6158204 to 703fb50 Compare February 17, 2017 21:20
@vishh

vishh commented Feb 17, 2017

Copy link
Copy Markdown
Contributor Author

@derekwaynecarr @dashpole @sjenning Node e2e (finally) passes for this PR. I'd appreciate a quick review since @dashpole intends to add evictions on top of this.

@derekwaynecarr derekwaynecarr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of comments and requests for changes throughout. I was confused on the getNodeAllocatable stuff, so if we can name or structure things a little differently, I will be less confused when we look again at the code.

fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")
fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")

fs.StringSliceVar(&s.EnforceNodeAllocatable, "enforce-node-allocatable", s.EnforceNodeAllocatable, "A comma separated list of levels of node allocatable enforcement to be enforced by kubelet. Acceptible options are 'pods', 'system-reserved' & 'kube-reserved'. If the latter two options are specified, '--system-reserved-cgroup' & '--kube-reserved-cgroup' must also be set respectively. See https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md for more details. [default='']")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Acceptable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought default was going to be 'pods'

also does the flag library not pull and display the default for us, or is there something about StringSliceVar that I am missing? I would need to verify to check, but for most other things, the default is rendered based on defaults.go value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should render the defaults, but I suspect the default string value is meant for automated docs.

Comment thread cmd/kubelet/app/server.go
if !s.CgroupsPerQOS && len(s.EnforceNodeAllocatable) > 0 {
return fmt.Errorf("Node Allocatable enforcement is not supported unless Cgroups Per QOS feature is turned on")
}
if s.SystemCgroups != "" && s.CgroupRoot == "" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, are you making use of system cgroups flag? wondering if we can just deprecate it in the future and have the operator take over the responsibility of parenting daemons correctly from the start.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still being used in the old debian based distro that GKE uses. We can deprecate it once GKE stops using that distro.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm - not needed by this pr, but we should probably put a big warning around those flags that you should not use them if your init system already places your stuff in a cgroup. maybe even error out completely if they are used and you run systemd.

Comment thread cmd/kubelet/app/server.go
NodeAllocatableConfig: cm.NodeAllocatableConfig{
KubeReservedCgroupName: s.KubeReservedCgroup,
SystemReservedCgroupName: s.SystemReservedCgroup,
EnforceNodeAllocatable: sets.NewString(s.EnforceNodeAllocatable...),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its possible we validate this value later, but just recording here for reference that we don't validate the items in this list are actual valid options. i also wonder if we want to treat this is an opaque set of strings, or a more concrete type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack that this is now validated.

zeroDuration = metav1.Duration{}
// Refer to [Node Allocatable](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) doc for more information.
// TODO: Set the default to "pods" once cgroups per qos is turned on by default.
defaultNodeAllocatableEnforcement = []string{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i see! ignore my earlier comment ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

nodeInfo *v1.Node
cgroupManager CgroupManager
capacity v1.ResourceList
cgroupRoot string

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a comment just calling out that this is the literal cgroupfs value so its not forgotten?

Comment thread pkg/kubelet/cm/node_allocatable.go Outdated
return &rc
}

func (cm *containerManagerImpl) getNodeAllocatableInternal(includeHardEviction bool) v1.ResourceList {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test case for this? i need to pull and physically test your pr to make sure it does what i think it is doing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we need to handle two cases:

  1. allocatable reported to scheduler which was capacity - reservations
  2. allocatable enforced via cgroup which was min(capacity - reservations + reservation, capacity)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after more thought, i think this is right.

so for the 32Gi, 2 gi reserve, 100Mi threshold, this would given 29.9Gi allocatable, and 30Gi enforced. sorry, its late on a friday and this is complicated ;-)

i think the issue is its usage of includeHardEviction..

it would be clearer if called something like: getNodeAllocatableFor(scheduling | enforcement)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming sounds useful. I will make the changes. Unit tests do exist.


// Enforce Node Allocatable (if required)
if err := enforceNodeAllocatableCgroups(cm.NodeConfig.NodeAllocatableConfig,
cm.getNodeAllocatableInternal(false), // ignore eviction thresholds.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought kubepods needs to have allocatable = capacity - reserved + thresholds enforced. right now, this appears to not give the buffer discussed.

Comment thread pkg/kubelet/cm/node_allocatable_test.go Outdated
NodeConfig: nc,
capacity: tc.capacity,
}
for k, v := range cm.GetNodeAllocatable() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing me. see my earlier comment on the internal version of the method, but this doesn't appear correct, and at minimum is not testing both patterns where we pass true or false to the operation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after further reflection, i think we just need a test that tests the enfored values and a test that verifies the reported to scheduler value so its clearer. i think this is correct for the as reported to scheduler case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I can split up the tests accordingly tomorrow.

Comment thread pkg/kubelet/cm/node_allocatable_test.go Outdated
hardThreshold evictionapi.ThresholdValue
}{
{
kubeReserved: getResourceList("100m", "100Mi"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a case for the 0 expected resources scenario when thresholds cause us to exceed capacity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a check in code and a test case

Comment thread test/e2e_node/container_manager_test.go Outdated
var _ = framework.KubeDescribe("Kubelet Container Manager [Serial]", func() {
f := framework.NewDefaultFramework("kubelet-container-manager")

Describe("Validate Node Allocatable", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning -- this seems like a pattern you could follow for your work in e2e

@dashpole dashpole left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while, but I think I understand the changes. Other than @derekwaynecarr's comments about validation, I found the allocatable code a bit difficult to read. I think we need to make sure that everything is either a "threshold", "reservation", "limit", or "resource", since an "allocatable" doesn't have a concrete meaning.

Comment thread cmd/kubelet/app/options/options.go Outdated
fs.Int64Var(&s.MaxOpenFiles, "max-open-files", s.MaxOpenFiles, "Number of files that can be opened by Kubelet process. [default=1000000]")
fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")
fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment thread cmd/kubelet/app/options/options.go Outdated
// Node Allocatable Flags
fs.Var(&s.SystemReserved, "system-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")
fs.Var(&s.KubeReserved, "kube-reserved", "A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment thread pkg/kubelet/cm/node_allocatable.go Outdated
}

// GetNodeAllocatable returns amount of compute resource available for pods.
func (cm *containerManagerImpl) GetNodeAllocatable() v1.ResourceList {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an eviction threshold, I would like to see the word threshold in this function name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or on second thought, GetNodeAllocatableResources() would also work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real NodeAllocatable from system and users standpoint.
The internal one is just meant for enforcing the real NodeAllocatable within kubelet.
If we were to rename, I'd rename the internal one.

Comment thread pkg/kubelet/cm/node_allocatable.go Outdated
return &rc
}

func (cm *containerManagerImpl) getNodeAllocatableInternal(includeHardEviction bool) v1.ResourceList {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would love a comment explaining the purpose of this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread pkg/kubelet/cm/node_allocatable.go Outdated
return nil
}

func enforceExistingCgroup(cgroupManager CgroupManager, cName string, rl v1.ResourceList) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment explaining function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@vishh vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2017
@vishh

vishh commented Feb 28, 2017

Copy link
Copy Markdown
Contributor Author

Err. rebase :(

vishh and others added 6 commits February 27, 2017 21:24
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh vishh removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2017
@vishh

vishh commented Feb 28, 2017

Copy link
Copy Markdown
Contributor Author

Bumping the priority further up because there are about 4+ PRs that are built on top of this which needs to go into v1.6

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2017
@vishh vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2017
@vishh

vishh commented Feb 28, 2017

Copy link
Copy Markdown
Contributor Author

adding LGTM label after rebasing.

@k8s-github-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: derekwaynecarr, k8s-merge-robot, vishh

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @lavalamp
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot

Copy link
Copy Markdown

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit fbe2d15 into kubernetes:master Feb 28, 2017
k8s-github-robot pushed a commit that referenced this pull request Mar 3, 2017
Automatic merge from submit-queue

Critial pod test uses allocatable instead of capacity

This solves #42239.

When this test was first introduced, pods could request up to the capacity of the node.
With the addition of allocatable introduced in #41234, this is no longer the case, and pods can only use up to allocatable.

This should be included in 1.6, as it is a bug related to a 1.6 feature.

cc @vish @yujuhong
k8s-github-robot pushed a commit that referenced this pull request Mar 4, 2017
Automatic merge from submit-queue

Eviction Manager Enforces Allocatable Thresholds

This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348.
This PR should be merged after #41234. 

cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh 

** Why is this a bug/regression**

Kubelet uses `oom_score_adj` to enforce QoS policies. But the `oom_score_adj` is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer.
Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature `out of Resource` handling.
Comment thread cmd/kubelet/app/server.go
Comment on lines +504 to +505
// Setup event recorder if required.
makeEventRecorder(&s.KubeletConfiguration, kubeDeps, nodeName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vishh , while reading the source code related EventRecorder in kubelet, I noticed that makeEventRecorder is called twice in the run function: one before ContainerManager initialization and once in the RunKubelet function. So i'm here to ask why makeEventRecorder is called before ContainerManager initialization. Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants