-
Notifications
You must be signed in to change notification settings - Fork 42.8k
Kubelet OOM killing in 'g1-small' node during huge-cluster perf test #47865
Copy link
Copy link
Closed
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.Highest priority. Must be actively worked on as someone's top priority right now.sig/networkCategorizes an issue or PR as relevant to SIG Network.Categorizes an issue or PR as relevant to SIG Network.sig/nodeCategorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Node.sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.Categorizes an issue or PR as relevant to SIG Scalability.
Milestone
Description
While running scalability tests today (as part of #47344) on a 4000-node GCE cluster, this happened during density test termination. Currently, load test is running.
It failed due to some density pod's condition not being updated and on digging up a bit turned out a couple of kubelets (one where the pod was running) crashed:
I0621 08:08:26.374] Jun 21 08:08:26.372: INFO: Waiting up to 3m0s for all (but 50) nodes to be ready
I0621 08:08:27.435] Jun 21 08:08:27.435: INFO: Condition Ready of node e2e-enormous-cluster-minion-group-1-xdwx is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
I0621 08:08:27.437] Jun 21 08:08:27.437: INFO: Condition Ready of node e2e-enormous-cluster-minion-group-nxl2 is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
..... repeats
From the kernel logs:
Jun 21 14:52:07.298991 e2e-enormous-cluster-minion-group-nxl2 kernel: Out of memory: Kill process 13774 (event-exporter) score 1684 or sacrifice child
Jun 21 14:52:07.312821 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 13774 (event-exporter) total-vm:1268972kB, anon-rss:1193588kB, file-rss:0kB
Jun 21 15:09:02.204129 e2e-enormous-cluster-minion-group-nxl2 kernel: fluentd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=883
Jun 21 15:09:02.298774 e2e-enormous-cluster-minion-group-nxl2 kernel: fluentd cpuset=1e88c29d9ecdec0d6d2e380aa6cf9c7b11db5a60b0bda35ac2b3694a58232b47 mems_allowed=0
..
..
Jun 21 17:06:42.055581 e2e-enormous-cluster-minion-group-nxl2 kernel: Memory cgroup out of memory: Kill process 16497 (ip-masq-agent) score 1463 or sacrifice child
Jun 21 17:06:42.055604 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 22398 (iptables-restor) total-vm:25744kB, anon-rss:3660kB, file-rss:0kB
Jun 21 17:07:46.960055 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 21 17:07:46.960183 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables-restor cpuset=8f72983ec1d83e25928f29a8b1ad953265489b4bf721e922db68bd70b11f2f31 mems_allowed=0
..
..
Jun 21 17:08:39.596296 e2e-enormous-cluster-minion-group-nxl2 kernel: Memory cgroup out of memory: Kill process 23412 (iptables) score 1866 or sacrifice child
Jun 21 17:08:39.596318 e2e-enormous-cluster-minion-group-nxl2 kernel: Killed process 23412 (iptables) total-vm:23460kB, anon-rss:7460kB, file-rss:4kB
Jun 21 17:08:43.074466 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=996
Jun 21 17:08:43.074553 e2e-enormous-cluster-minion-group-nxl2 kernel: iptables cpuset=db1fd653a6e6cd30ed40ddf44828ad5159f83e75a02bee69bf8c648335b75e7e mems_allowed=0
The cluster is still running and to reach the node:
gcloud compute ssh e2e-enormous-cluster-minion-group-nxl2 --project kubernetes-scale --zone us-east1-a
cc @kubernetes/sig-node-bugs @kubernetes/sig-scalability-misc @dchen1107 @yujuhong @gmarek
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.Highest priority. Must be actively worked on as someone's top priority right now.sig/networkCategorizes an issue or PR as relevant to SIG Network.Categorizes an issue or PR as relevant to SIG Network.sig/nodeCategorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Node.sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.Categorizes an issue or PR as relevant to SIG Scalability.