Eviction Manager Enforces Allocatable Thresholds by dashpole · Pull Request #42204 · kubernetes/kubernetes

dashpole · 2017-02-28T00:27:20Z

This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348.
This PR should be merged after #41234.

cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh

** Why is this a bug/regression**

Kubelet uses oom_score_adj to enforce QoS policies. But the oom_score_adj is based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer.
Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature out of Resource handling.

k8s-github-robot · 2017-02-28T00:27:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: dashpole

Needs approval from an approver in each of these OWNERS Files:

pkg/kubelet/OWNERS

We suggest the following people:
cc @timstclair
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

dashpole · 2017-02-28T00:40:55Z

@k8s-bot test this

dashpole · 2017-02-28T00:41:43Z

@k8s-bot test this

dashpole · 2017-02-28T00:49:08Z

@k8s-bot test this

vishh · 2017-02-28T00:56:05Z

it's not possible to have allocatable here because we currently need an initialized container manager module to get allocatable reservation. better to inject a node status provider into eviction manager to get allocatable by querying the latest v1.Node object

derekwaynecarr · 2017-02-28T04:30:37Z

I would like a chance to review this as well prior to merge

dashpole · 2017-02-28T16:58:12Z

/release-note-none

dashpole · 2017-02-28T19:22:39Z

This is an overview of this solution:
AllocatableMemory.Capacity = node.Status.Allocatable (which has already subtracted the eviction threshold)
AllocatableMemory.Available = AllocatableMemory.Capacity - sum_over_pods(pod.usage)
Threshold: AllocatableMemory.Available, LessThanOp, Resource(0)

dashpole · 2017-02-28T19:40:46Z

I think this is ready for you to take a look at. Builds appear to have failed because it failed to pull kube-proxy
I am not sure what the best way to add the "Zero available threshold" is, and the way I did it is a bit ugly. Feel free to make suggestions.

vishh · 2017-02-28T19:41:00Z

 	// SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers.
 	SignalImageFsInodesFree Signal = "imagefs.inodesFree"
+	// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes.
+	SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available"


make this private

I would have to move it out of eviction api, since this is used within the eviction package.

is this a user-facing?

seeing that this is not user-facing, this should be private. agreed.

Just to clarify, in-case I wasnt clear: This is now in a separate package from the rest of the eviction code, so I have 2 options:

I can move this signal (or all of them) to eviction/types.go, and make it private. However, then it is separated from the rest of the signals.

I can leave it public.

I am happy to do either, just let me know.

Existing code is fine by me as long as we don't add support for parsing it.

vishh · 2017-02-28T19:51:03Z

+func ParseThresholdConfig(includeAllocatableThreshold bool, evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) {
 	results := []evictionapi.Threshold{}
+	if includeAllocatableThreshold {
+		results = append(results, evictionapi.Threshold{


Don't turn this on by default. Instead tie allocatable limits to memory.available signal.

I am not sure I understand either point.
The threshold here on or off based on includeAllocatableThreshold.
Why would I tie allocatable limits to available memory?

Redacted. Talked to @dashpole offline.

vishh · 2017-02-28T19:52:11Z

 	}

-	thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
+	addAllocatableEvictionThreshold := !kubeCfg.ExperimentalNodeAllocatableIgnoreEvictionThreshold && len(kubeCfg.EnforceNodeAllocatable) > 0


You need to check that kubeCfg.EnforceNodeAllocatable contains pods in its value.

Done. I reworked this by passing kubeCfg to the thresholds parser. It adds the threshold if "pods" (cm.NodeAllocatableEnforcementKey) is found.

vishh · 2017-02-28T19:52:26Z

an e2e will be helpful

derekwaynecarr

i would like some more test cases for TestParseThresholdConfig. i also would like to see the documented behavior for memory.available here to have some text that says how memory.available has two meanings.

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/kubelet-eviction.md#eviction-signals

derekwaynecarr · 2017-02-28T22:43:14Z

 	// SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers.
 	SignalImageFsInodesFree Signal = "imagefs.inodesFree"
+	// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes.
+	SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available"


is this a user-facing?

derekwaynecarr · 2017-02-28T22:55:13Z

 	// SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers.
 	SignalImageFsInodesFree Signal = "imagefs.inodesFree"
+	// SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes.
+	SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available"


seeing that this is not user-facing, this should be private. agreed.

derekwaynecarr · 2017-02-28T22:56:43Z

-func ParseThresholdConfig(evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) {
+func ParseThresholdConfig(allocatableConfig []string, evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) {
 	results := []evictionapi.Threshold{}
+	allocatableThresholds := getAllocatableThreshold(allocatableConfig)


i would like a test case for this use case.

derekwaynecarr · 2017-02-28T22:57:40Z

 	}
 	for testName, testCase := range testCases {
-		thresholds, err := ParseThresholdConfig(testCase.evictionHard, testCase.evictionSoft, testCase.evictionSoftGracePeriod, testCase.evictionMinReclaim)
+		thresholds, err := ParseThresholdConfig([]string{}, testCase.evictionHard, testCase.evictionSoft, testCase.evictionSoftGracePeriod, testCase.evictionMinReclaim)


we need tests that exercise not passing [] into the Parse call..

k8s-reviewable · 2017-03-01T23:52:38Z

This change is

dashpole · 2017-03-02T03:31:23Z

@k8s-bot gci gke e2e test this

dashpole · 2017-03-02T05:35:38Z

@k8s-bot gce etcd3 e2e test this
@k8s-bot cvm gke e2e test this
@k8s-bot gci gke e2e test this

dashpole · 2017-03-02T15:37:09Z

ingress failures seem unrelated, but I rebased to be sure, since the testgrid for that test is green.

dashpole · 2017-03-02T16:32:14Z

Mar 2 08:06:36.589: INFO: Error creating firewall-rules, output: ERROR: (gcloud.compute.firewall-rules.create) Some requests did not succeed:
- The resource 'projects/k8s-jkns-pr-gke/global/networks/jenkins-e2e' was not found

This seems unrelated...

dashpole · 2017-03-02T16:52:28Z

this issue appears plauge other PRs as well: #42362, so I think it is safe to ignore.

vishh · 2017-03-02T18:20:31Z

@k8s-bot cvm gke e2e test this

vishh · 2017-03-02T18:20:40Z

@k8s-bot gci gke e2e test this

ethernetdan · 2017-03-02T18:41:01Z

@vishh could an exception request please be put in for this PR? unless this targets 1.7

vishh · 2017-03-02T19:00:29Z

@ethernetdan an exception request has already been posted - https://groups.google.com/forum/#!topic/kubernetes-milestone-burndown/NgpKrBm2gcA/discussion

vishh · 2017-03-02T19:01:18Z

@calebamiles As per https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go ahead and un block this PR?

ethernetdan · 2017-03-02T19:05:36Z

Missed that, thanks!

…

On Thu, Mar 2, 2017, 11:01 AM Vish Kannan ***@***.***> wrote: @calebamiles <https://github.com/calebamiles> As per https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go ahead and un block this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42204 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD1BORN1BsE2Ck59e5UobQWmNE6TbawGks5rhxIagaJpZM4MNz5y> .

ethernetdan · 2017-03-02T19:06:16Z

Please wait until a decision is made at the burndown tomorrow On Thu, Mar 2, 2017, 11:05 AM Dan Gillespie <dan.gillespie@coreos.com> wrote:

…

Missed that, thanks! On Thu, Mar 2, 2017, 11:01 AM Vish Kannan ***@***.***> wrote: @calebamiles <https://github.com/calebamiles> As per https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go ahead and un block this PR? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42204 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AD1BORN1BsE2Ck59e5UobQWmNE6TbawGks5rhxIagaJpZM4MNz5y> .

vishh · 2017-03-02T19:11:27Z

@ethernetdan we already had a long chat with @calebamiles about this and a couple other PRs in that exception request yesterday. Can you sync up with him?

vishh · 2017-03-02T20:05:53Z

Given that this PR is a bug fix (as described above), I'm lifting the merge blocker.

k8s-github-robot · 2017-03-04T04:20:12Z

Automatic merge from submit-queue

The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 28, 2017

k8s-github-robot assigned mtaufen Feb 28, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Feb 28, 2017

dashpole force-pushed the allocatable_eviction branch from d2e5597 to e4147f7 Compare February 28, 2017 00:39

vishh reviewed Feb 28, 2017

View reviewed changes

vishh assigned vishh and unassigned mtaufen Feb 28, 2017

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Feb 28, 2017

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2017

dashpole force-pushed the allocatable_eviction branch from da9ef1d to ed7d67d Compare February 28, 2017 19:02

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2017

vishh reviewed Feb 28, 2017

View reviewed changes

vishh added this to the v1.6 milestone Feb 28, 2017

vishh added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2017

derekwaynecarr suggested changes Feb 28, 2017

View reviewed changes

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 2, 2017

eviction manager changes for allocatable

ac612ea

dashpole force-pushed the allocatable_eviction branch from e6e446b to ac612ea Compare March 2, 2017 15:36

derekwaynecarr added the kind/bug Categorizes issue or PR as related to a bug. label Mar 2, 2017

derekwaynecarr added this to the v1.6 milestone Mar 2, 2017

derekwaynecarr approved these changes Mar 2, 2017

View reviewed changes

vishh added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Mar 2, 2017

k8s-github-robot merged commit 2d319bd into kubernetes:master Mar 4, 2017

dashpole deleted the allocatable_eviction branch March 14, 2017 16:45

Conversation

dashpole commented Feb 28, 2017 • edited by vishh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-github-robot commented Feb 28, 2017

Uh oh!

dashpole commented Feb 28, 2017

Uh oh!

dashpole commented Feb 28, 2017

Uh oh!

dashpole commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr commented Feb 28, 2017

Uh oh!

dashpole commented Feb 28, 2017

Uh oh!

dashpole commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashpole commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishh commented Feb 28, 2017

Uh oh!

derekwaynecarr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-reviewable commented Mar 1, 2017

Uh oh!

dashpole commented Mar 2, 2017

Uh oh!

dashpole commented Mar 2, 2017

Uh oh!

dashpole commented Mar 2, 2017

Uh oh!

dashpole commented Mar 2, 2017

Uh oh!

dashpole commented Mar 2, 2017

Uh oh!

vishh commented Mar 2, 2017

Uh oh!

vishh commented Mar 2, 2017

Uh oh!

ethernetdan commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vishh commented Mar 2, 2017

dashpole commented Feb 28, 2017 •

edited by vishh

Loading

dashpole commented Feb 28, 2017 •

edited

Loading

ethernetdan commented Mar 2, 2017 •

edited

Loading