Skip to content

set KillMode for kubelet to process, fix for #13511#23491

Merged
j3ffml merged 1 commit intokubernetes:masterfrom
onorua:master
Apr 8, 2016
Merged

set KillMode for kubelet to process, fix for #13511#23491
j3ffml merged 1 commit intokubernetes:masterfrom
onorua:master

Conversation

@onorua
Copy link
Copy Markdown
Contributor

@onorua onorua commented Mar 25, 2016

Restart kubelet process, not the resource group, more details RHEL admin manual and #13511
New Ubuntu LTS 16.04 will have systemd by default, which may increase amount of complains and bug like we had. I propose to upstream the configuration.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Mar 25, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Mar 25, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Mar 25, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 25, 2016
@k8s-github-robot
Copy link
Copy Markdown

Labelling this PR as size/XS

@davidopp davidopp assigned mikedanese and unassigned davidopp Mar 26, 2016
@mikedanese
Copy link
Copy Markdown
Member

cc @dchen1107 @vishh

--reconcile-cidr=false
Restart=always
RestartSec=10
KillMode=process
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this for master too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it for consistency.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect is that we probably leak master spawned (if any) process when systemd restarts master.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... what is the difference between kubelet on master node and on compute node? I mean if there is a risk of leaking something, it is dangerous the same way on all nodes. If there is no risk - then it is okay on all nodes as well. At least this is my, maybe naive, understanding of kubelet functionality and kubernetes pods management.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. @vishh ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the master node, excepting the kubelet, all other components are expected to run in containers. So this change should be fine.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 4, 2016

Kubelet execs other processes. Does it make sense to run kubelet in a separate cgroup?

@rootfs
Copy link
Copy Markdown

rootfs commented Apr 4, 2016

Running kubelet in different cgroup might not help, since systemd tracks the group kubelet is in and kill them all.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 4, 2016

Its still not clear why cleaning up kubelet service's cgroup is an issue. Am I missing some detail from systemd perspective?

@rootfs
Copy link
Copy Markdown

rootfs commented Apr 4, 2016

@vishh systemd cleaning up kubelet and its control group has a side effect of killing glusterfs (or other FUSE) mount, as identified in the environment of @onorua, because the mount daemon is exec'ed by kubelet.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 5, 2016

How is the lifecycle of the mount daemon managed by kubelet?

@rootfs
Copy link
Copy Markdown

rootfs commented Apr 5, 2016

When a pod requests a e.g. glusterfs, kubelet mounter will in the end invoke the glusterfs mount daemon. The daemon stays till the volume is unmounted. If systemd stops kubelet with killMode=control-group (the default), both kubelet and glusterfs daemon is killed, while the container stays alive with a broken bind mount. When systemd starts kubelet again, even though kubelet is able to re-mount the volume, the broken bind mount in the container cannot be repaired. This is what happened in #13511

The proposed fix is to tell systemd to just kill kubelet and leave other processes alive by setting killMode=process, the glusterfs daemon stays with the container when kubelet is stopped.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 5, 2016

cc @kubernetes/sig-node

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 5, 2016

Ideally, the FUSE daemon should be run in the pod's scope and not kubelet's scope. However this might need some re-design of these volumes. cc @thockin

@ncdc
Copy link
Copy Markdown
Member

ncdc commented Apr 5, 2016

cc @kubernetes/rh-cluster-infra @smarterclayton

@rootfs
Copy link
Copy Markdown

rootfs commented Apr 5, 2016

Longer term I would vote for moving FUSE daemon to Pod's scope as it also helps resource accounting.

Between now and then, the proposed fix gives us the correct mount behavior we need during kubelet restart.

@derekwaynecarr
Copy link
Copy Markdown
Member

@eparis - are there additional unit files for kubelet that would need to be covered by this change?

@vishh @dchen1107 - this means we need to account for resource consumption of storage driver daemons as part of kube-reserved for now... I will make a point of noting that in the systemd node spec for 1.3.

@derekwaynecarr
Copy link
Copy Markdown
Member

A good reason for us to finally implement pod level cgroups... do we end
up running a FUSE daemon per pod? How does it work in the current model if
I have multiple pods on a node that want to use FUSE? Do they share a
daemon or each get their own instance?

On Tue, Apr 5, 2016 at 3:13 PM, Huamin Chen notifications@github.com
wrote:

I would vote for moving FUSE daemon to Pod's scope as it also helps
resource accounting.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#23491 (comment)

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 5, 2016 via email

@eparis
Copy link
Copy Markdown
Contributor

eparis commented Apr 6, 2016

@derekwaynecarr over in https://github.com/kubernetes/contrib/blob/master/init/systemd/kubelet.service

/me is sad that we moved the system units out of the tree and then duplicated them....

@rootfs
Copy link
Copy Markdown

rootfs commented Apr 6, 2016

Back to this PR, moving daemons (FUSE or probable network plugins) to its own cgroup doesn't conflict with setting systemd killMode=process. For now, killing kubelet and leaving daemons alive keeps the Pod's mount. Once daemons live in their own cgroup, kubelet is the only process in the group and thus systemd KillMode=process essentially has the same effect as KillMode=control-group.

@mikedanese mikedanese removed their assignment Apr 6, 2016
@vishh
Copy link
Copy Markdown
Contributor

vishh commented Apr 6, 2016

I guess this PR will work for now, without (hopefully) any process leaks. LGTM.

@vishh vishh added lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-merge labels Apr 6, 2016
@k8s-github-robot
Copy link
Copy Markdown

@k8s-bot ok to test
@k8s-bot test this

pr builder appears to be missing, activating due to 'lgtm' label.

@k8s-github-robot
Copy link
Copy Markdown

Removing LGTM because the release note process has not been followed.
One of the following labels is required "release-note", "release-note-none", or "release-note-action-required"
Please see: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/release-notes.md

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2016
@vishh vishh added release-note-none Denotes a PR that doesn't merit a release note. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels Apr 6, 2016
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Apr 6, 2016

GCE e2e build/test passed for commit 0bfc496.

@k8s-github-robot
Copy link
Copy Markdown

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Apr 8, 2016

GCE e2e build/test passed for commit 0bfc496.

@j3ffml j3ffml merged commit e17213a into kubernetes:master Apr 8, 2016
@wwwtyro
Copy link
Copy Markdown
Contributor

wwwtyro commented Oct 5, 2016

This appears to be leaking journalctl processes originating here: https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/google/cadvisor/utils/oomparser/oomparser.go#L169

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Oct 17, 2016

@wwwtyro Thanks for reporting the regression. I filed #34965 to fix it.

@wwwtyro
Copy link
Copy Markdown
Contributor

wwwtyro commented Oct 19, 2016

@vishh awesome, thank you

dchen1107 added a commit to dchen1107/kubernetes-1 that referenced this pull request Oct 25, 2016
Fixed the regression caused by kubernetes#23491 which fix gcluster umount issue.
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Jul 31, 2019
Bug 1732193: UPSTREAM: 80518: Fix detachment of deleted volumes

Origin-commit: b793b93e81b28e3a30f4b3ad722267830767827d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.