Eviction Manager Enforces Allocatable Thresholds#42204
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED The following people have approved this PR: dashpole Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
d2e5597 to
e4147f7
Compare
|
@k8s-bot test this |
There was a problem hiding this comment.
it's not possible to have allocatable here because we currently need an initialized container manager module to get allocatable reservation. better to inject a node status provider into eviction manager to get allocatable by querying the latest v1.Node object
|
I would like a chance to review this as well prior to merge |
|
/release-note-none |
da9ef1d to
ed7d67d
Compare
|
This is an overview of this solution: |
|
I think this is ready for you to take a look at. Builds appear to have failed because it failed to pull kube-proxy |
| // SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers. | ||
| SignalImageFsInodesFree Signal = "imagefs.inodesFree" | ||
| // SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes. | ||
| SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available" |
There was a problem hiding this comment.
I would have to move it out of eviction api, since this is used within the eviction package.
There was a problem hiding this comment.
seeing that this is not user-facing, this should be private. agreed.
There was a problem hiding this comment.
Just to clarify, in-case I wasnt clear: This is now in a separate package from the rest of the eviction code, so I have 2 options:
- I can move this signal (or all of them) to eviction/types.go, and make it private. However, then it is separated from the rest of the signals.
- I can leave it public.
I am happy to do either, just let me know.
There was a problem hiding this comment.
Existing code is fine by me as long as we don't add support for parsing it.
| func ParseThresholdConfig(includeAllocatableThreshold bool, evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) { | ||
| results := []evictionapi.Threshold{} | ||
| if includeAllocatableThreshold { | ||
| results = append(results, evictionapi.Threshold{ |
There was a problem hiding this comment.
Don't turn this on by default. Instead tie allocatable limits to memory.available signal.
There was a problem hiding this comment.
I am not sure I understand either point.
The threshold here on or off based on includeAllocatableThreshold.
Why would I tie allocatable limits to available memory?
| } | ||
|
|
||
| thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim) | ||
| addAllocatableEvictionThreshold := !kubeCfg.ExperimentalNodeAllocatableIgnoreEvictionThreshold && len(kubeCfg.EnforceNodeAllocatable) > 0 |
There was a problem hiding this comment.
You need to check that kubeCfg.EnforceNodeAllocatable contains pods in its value.
There was a problem hiding this comment.
Done. I reworked this by passing kubeCfg to the thresholds parser. It adds the threshold if "pods" (cm.NodeAllocatableEnforcementKey) is found.
|
an e2e will be helpful |
derekwaynecarr
left a comment
There was a problem hiding this comment.
i would like some more test cases for TestParseThresholdConfig. i also would like to see the documented behavior for memory.available here to have some text that says how memory.available has two meanings.
| // SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers. | ||
| SignalImageFsInodesFree Signal = "imagefs.inodesFree" | ||
| // SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes. | ||
| SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available" |
| // SignalImageFsInodesFree is amount of inodes available on filesystem that container runtime uses for storing images and container writeable layers. | ||
| SignalImageFsInodesFree Signal = "imagefs.inodesFree" | ||
| // SignalAllocatableMemoryAvailable is amount of memory available for pod allocation (i.e. allocatable - workingSet (of pods), in bytes. | ||
| SignalAllocatableMemoryAvailable Signal = "allocatableMemory.available" |
There was a problem hiding this comment.
seeing that this is not user-facing, this should be private. agreed.
| func ParseThresholdConfig(evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) { | ||
| func ParseThresholdConfig(allocatableConfig []string, evictionHard, evictionSoft, evictionSoftGracePeriod, evictionMinimumReclaim string) ([]evictionapi.Threshold, error) { | ||
| results := []evictionapi.Threshold{} | ||
| allocatableThresholds := getAllocatableThreshold(allocatableConfig) |
There was a problem hiding this comment.
i would like a test case for this use case.
| } | ||
| for testName, testCase := range testCases { | ||
| thresholds, err := ParseThresholdConfig(testCase.evictionHard, testCase.evictionSoft, testCase.evictionSoftGracePeriod, testCase.evictionMinReclaim) | ||
| thresholds, err := ParseThresholdConfig([]string{}, testCase.evictionHard, testCase.evictionSoft, testCase.evictionSoftGracePeriod, testCase.evictionMinReclaim) |
There was a problem hiding this comment.
we need tests that exercise not passing [] into the Parse call..
|
@k8s-bot gci gke e2e test this |
e6e446b to
ac612ea
Compare
|
ingress failures seem unrelated, but I rebased to be sure, since the testgrid for that test is green. |
|
This seems unrelated... |
|
this issue appears plauge other PRs as well: #42362, so I think it is safe to ignore. |
|
@k8s-bot cvm gke e2e test this |
|
@k8s-bot gci gke e2e test this |
|
@vishh could an exception request please be put in for this PR? unless this targets 1.7 |
|
@ethernetdan an exception request has already been posted - https://groups.google.com/forum/#!topic/kubernetes-milestone-burndown/NgpKrBm2gcA/discussion |
|
@calebamiles As per https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go ahead and un block this PR? |
|
Missed that, thanks!
…On Thu, Mar 2, 2017, 11:01 AM Vish Kannan ***@***.***> wrote:
@calebamiles <https://github.com/calebamiles> As per
https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go
ahead and un block this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42204 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD1BORN1BsE2Ck59e5UobQWmNE6TbawGks5rhxIagaJpZM4MNz5y>
.
|
|
Please wait until a decision is made at the burndown tomorrow
On Thu, Mar 2, 2017, 11:05 AM Dan Gillespie <dan.gillespie@coreos.com>
wrote:
… Missed that, thanks!
On Thu, Mar 2, 2017, 11:01 AM Vish Kannan ***@***.***>
wrote:
@calebamiles <https://github.com/calebamiles> As per
https://kubernetes.slack.com/archives/sig-node/p1488417902001532 can I go
ahead and un block this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42204 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AD1BORN1BsE2Ck59e5UobQWmNE6TbawGks5rhxIagaJpZM4MNz5y>
.
|
|
@ethernetdan we already had a long chat with @calebamiles about this and a couple other PRs in that exception request yesterday. Can you sync up with him? |
|
Given that this PR is a bug fix (as described above), I'm lifting the merge blocker. |
|
Automatic merge from submit-queue |
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
This PR modifies the eviction manager to enforce node allocatable thresholds for memory as described in kubernetes/community#348.
This PR should be merged after #41234.
cc @kubernetes/sig-node-pr-reviews @kubernetes/sig-node-feature-requests @vishh
** Why is this a bug/regression**
Kubelet uses
oom_score_adjto enforce QoS policies. But theoom_score_adjis based on overall memory requested, which means that a Burstable pod that requested a lot of memory can lead to OOM kills for Guaranteed pods, which violates QoS. Even worse, we have observed system daemons like kubelet or kube-proxy being killed by the OOM killer.Without this PR, v1.6 will have node stability issues and regressions in an existing GA feature
out of Resourcehandling.