set KillMode for kubelet to process, fix for #13511#23491
set KillMode for kubelet to process, fix for #13511#23491j3ffml merged 1 commit intokubernetes:masterfrom
Conversation
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
2 similar comments
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
|
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
|
Labelling this PR as size/XS |
|
cc @dchen1107 @vishh |
| --reconcile-cidr=false | ||
| Restart=always | ||
| RestartSec=10 | ||
| KillMode=process |
There was a problem hiding this comment.
I've added it for consistency.
There was a problem hiding this comment.
The side effect is that we probably leak master spawned (if any) process when systemd restarts master.
There was a problem hiding this comment.
hm... what is the difference between kubelet on master node and on compute node? I mean if there is a risk of leaking something, it is dangerous the same way on all nodes. If there is no risk - then it is okay on all nodes as well. At least this is my, maybe naive, understanding of kubelet functionality and kubernetes pods management.
There was a problem hiding this comment.
In the master node, excepting the kubelet, all other components are expected to run in containers. So this change should be fine.
|
Kubelet execs other processes. Does it make sense to run kubelet in a separate cgroup? |
|
Running kubelet in different cgroup might not help, since systemd tracks the group kubelet is in and kill them all. |
|
Its still not clear why cleaning up kubelet service's cgroup is an issue. Am I missing some detail from systemd perspective? |
|
How is the lifecycle of the mount daemon managed by kubelet? |
|
When a pod requests a e.g. glusterfs, kubelet mounter will in the end invoke the glusterfs mount daemon. The daemon stays till the volume is unmounted. If systemd stops kubelet with killMode=control-group (the default), both kubelet and glusterfs daemon is killed, while the container stays alive with a broken bind mount. When systemd starts kubelet again, even though kubelet is able to re-mount the volume, the broken bind mount in the container cannot be repaired. This is what happened in #13511 The proposed fix is to tell systemd to just kill kubelet and leave other processes alive by setting killMode=process, the glusterfs daemon stays with the container when kubelet is stopped. |
|
cc @kubernetes/sig-node |
|
Ideally, the FUSE daemon should be run in the pod's scope and not kubelet's scope. However this might need some re-design of these volumes. cc @thockin |
|
cc @kubernetes/rh-cluster-infra @smarterclayton |
|
Longer term I would vote for moving FUSE daemon to Pod's scope as it also helps resource accounting. Between now and then, the proposed fix gives us the correct mount behavior we need during kubelet restart. |
|
@eparis - are there additional unit files for kubelet that would need to be covered by this change? @vishh @dchen1107 - this means we need to account for resource consumption of storage driver daemons as part of kube-reserved for now... I will make a point of noting that in the systemd node spec for 1.3. |
|
A good reason for us to finally implement pod level cgroups... do we end On Tue, Apr 5, 2016 at 3:13 PM, Huamin Chen notifications@github.com
|
|
Pod level cgroups might will have its own set of issues. But yes, that the
direction we need to move towards. Fixing the underlying problem will
require refactoring of the volumes plugins though. I'd prefer managing the
FUSE daemon as a separate systemd service, if at all possible.
|
|
@derekwaynecarr over in https://github.com/kubernetes/contrib/blob/master/init/systemd/kubelet.service /me is sad that we moved the system units out of the tree and then duplicated them.... |
|
Back to this PR, moving daemons (FUSE or probable network plugins) to its own cgroup doesn't conflict with setting systemd killMode=process. For now, killing kubelet and leaving daemons alive keeps the Pod's mount. Once daemons live in their own cgroup, kubelet is the only process in the group and thus systemd KillMode=process essentially has the same effect as KillMode=control-group. |
|
I guess this PR will work for now, without (hopefully) any process leaks. LGTM. |
|
Removing LGTM because the release note process has not been followed. |
|
GCE e2e build/test passed for commit 0bfc496. |
|
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
|
GCE e2e build/test passed for commit 0bfc496. |
|
This appears to be leaking journalctl processes originating here: https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/google/cadvisor/utils/oomparser/oomparser.go#L169 |
|
@vishh awesome, thank you |
Fixed the regression caused by kubernetes#23491 which fix gcluster umount issue.
Bug 1732193: UPSTREAM: 80518: Fix detachment of deleted volumes Origin-commit: b793b93e81b28e3a30f4b3ad722267830767827d
Restart kubelet process, not the resource group, more details RHEL admin manual and #13511
New Ubuntu LTS 16.04 will have systemd by default, which may increase amount of complains and bug like we had. I propose to upstream the configuration.