fix killall when use pidnamespace #3149

lifubang · 2019-03-30T04:27:40Z

For issue moby/moby#38978 :
We should let KillAll run when pid namespace path is not empty.

Signed-off-by: Lifubang lifubang@acmcoder.com

Signed-off-by: Lifubang <lifubang@acmcoder.com>

codecov-io · 2019-03-30T04:46:55Z

Codecov Report

Merging #3149 into master will decrease coverage by 4.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   49.25%   45.16%   -4.09%     
==========================================
  Files         100      111      +11     
  Lines        9415    11962    +2547     
==========================================
+ Hits         4637     5403     +766     
- Misses       3955     5727    +1772     
- Partials      823      832       +9

Flag	Coverage Δ
#linux	`49.25% <ø> (ø)`	⬆️
#windows	`40.49% <ø> (?)`

Impacted Files	Coverage Δ
snapshots/native/native.go	`43.04% <0%> (-9.99%)`	⬇️
metadata/snapshot.go	`45.8% <0%> (-8.96%)`	⬇️
archive/tar.go	`43.79% <0%> (-7.07%)`	⬇️
metadata/containers.go	`47.97% <0%> (-6.62%)`	⬇️
content/local/writer.go	`57.84% <0%> (-6.36%)`	⬇️
content/local/store.go	`48.51% <0%> (-5.03%)`	⬇️
metadata/images.go	`57.57% <0%> (-4.99%)`	⬇️
archive/tar_opts.go	`28.57% <0%> (-4.77%)`	⬇️
archive/compression/compression.go	`58.69% <0%> (-4.7%)`	⬇️
metadata/buckets.go	`56.33% <0%> (-4.6%)`	⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d0a06d...8722966. Read the comment docs.

fuweid

LGTM

and @lifubang could you please to add the testcase for this? Thanks!
I think that original test doesn't cover this case

containerd/container_linux_test.go

Lines 1324 to 1330 in 8722966

    
           func TestContainerKillInitPidHost(t *testing.T) { 
        
           	initContainerAndCheckChildrenDieOnKill(t, oci.WithHostNamespace(specs.PIDNamespace)) 
        
           } 
        
           func TestContainerKillInitKillsChildWhenNotHostPid(t *testing.T) { 
        
           	initContainerAndCheckChildrenDieOnKill(t) 
        
           }

fuweid · 2019-03-30T13:15:23Z

I think it should be cherry-picked into v1.2 release.

cc @Random-Liu

fuweid · 2019-03-30T13:57:40Z

Wait. Hmm. I checked the https://github.com/containerd/containerd/pull/2597/files and found it is cgroup share case, not pid namespace.

When I remove namespace check in my local

func (s *Service) checkProcesses(e runc.Exit) {
-       shouldKillAll, err := shouldKillAllOnExit(s.bundle)
-       if err != nil {
-               log.G(s.context).WithError(err).Error("failed to check shouldKillAll")
-       }
-
        for _, p := range s.allProcesses() {
                if p.Pid() == e.Pid {

-                       if shouldKillAll {
-                               if ip, ok := p.(*proc.Init); ok {
-                                       // Ensure all children are killed
-                                       if err := ip.KillAll(s.context); err != nil {
-                                               log.G(s.context).WithError(err).WithField("id", ip.ID()).
-                                                       Error("failed to kill init's children")
-                                       }
+                       if ip, ok := p.(*proc.Init); ok {
+                               // Ensure all children are killed
+                               if err := ip.KillAll(s.context); err != nil {
+                                       log.G(s.context).WithError(err).WithField("id", ip.ID()).
+                                               Error("failed to kill init's children")
                                }
                        }
                        p.SetExited(e.Status)

And using following script to test

# create pause container
➜  /tmp ctr run -d docker.io/library/busybox:1.25 pause sh -c "sleep 1000"

# check pid
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING

# run other pause container with namespace
➜  /tmp ctr run -d --with-ns 'pid:/proc/19934/ns/pid' docker.io/library/busybox:1.25 pause-too sh -c "sleep 1000"
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING
pause-too    20116    RUNNING
➜  /tmp cat /run/containerd/io.containerd.runtime.v1.linux/default/pause-too/config.json  | jq '.linux.namespaces[0]'
{
  "type": "pid",
  "path": "/proc/19934/ns/pid"
}

# kill the pause-too and pause container is still living
➜  /tmp ctr t kill -s SIGKILL -a pause-too
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING
pause-too    20116    STOPPED

So I think we should remove the pid namespace check @lifubang

There is the killAll in runc https://github.com/opencontainers/runc/blob/master/libcontainer/init_linux.go#L474

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
func signalAllProcesses(m cgroups.Manager, s os.Signal) error {

thaJeztah · 2019-03-30T15:13:12Z

Thanks! Yes this should be cherry-picked into the relevant release branches

Random-Liu · 2019-03-30T23:34:54Z

I remember we fixed this in the cri plugin.

We didn’t fix this in the shim because that check was just added for shared pod namespace case. See #2558

We may want to fix this in docker instead if we think the issue above is a valid case we would like to support.

lifubang · 2019-03-31T04:59:27Z

We may want to fix this in docker instead if we think the issue above is a valid case we would like to support.

I think this patch may be more easy to fix this situation.
And this patch also support #2558, because for shared pod namespace case, the pid namespace path value is "".

Is there any other situations I ignored?

lifubang · 2019-03-31T08:18:43Z

@fuweid I think we can't remove pid namespace check before call killAll, because we should support containers with the same cgroup path.
So, I fixed it in moby. Please see moby/moby#38980
Does it make sense? if yes, this PR can be closed.

fuweid · 2019-03-31T08:26:30Z

If my understand is correct, the share same cgroup path is not related to pid namespace.
If we don't remove the pid check and init process is dead, the processes created by init process will not be terminated, right?

lifubang · 2019-03-31T15:35:11Z

If we don't remove the pid check and init process is dead, the processes created by init process will not be terminated, right?

If use host pid namespace, without killAll, the children process in the cgroup will be killed by runc after the init process dead.
And killAll in runc will kill all process in the cgroup.
For these two reasons, we should not remove pid namespace check.

fuweid · 2019-03-31T15:53:07Z

@lifubang thanks for explanation. I think we can close it because it supports for share cgroup case.

fuweid · 2019-03-31T16:02:24Z

but I think we still need think the share cgroup case is what we need here because there might be leakage of process. When the container A joined other container B with pid namespace, the init process in A exits and the child processes are still alive until containerd B is dead. It might be problem here.

Random-Liu · 2019-03-31T18:04:19Z

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

…

On Sun, Mar 31, 2019 at 9:02 AM Wei Fu ***@***.***> wrote: but I think we still need think the share cgroup case because there might be leakage of process. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFjVu2s0ujxcGMhmIGdceCSQXO5V0K17ks5vcNwTgaJpZM4cTiB7> .

lifubang · 2019-04-01T00:15:46Z

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

My English is not very well. @Random-Liu , I think @fuweid 's meaning is only share pid namespace, but don't enter the same cgroup path. I think this is different from share cgroup.

Now, they are 3 situations:

1. Share cgroup: two containers have the same cgroup path

This is the case like the issue #2558
This situation needs shouldKillAllOnExit to check whether we use new pid namespace or not.

2. Share pid namespace: Container A joined the Container B's pid namespace

This is the case like the issue moby/moby#38978
I think this situation doesn't need shouldKillAllOnExit check. It should run killAll anyway.

3. Use the same parent cgroup: like k8s pod

I think this situation also don't need shouldKillAllOnExit check. It should run killAll anyway.

I think the case (2) and (3) is used more widely than the case (1) .
So like @fuweid mentioned in #3149 (comment) , we can remove shouldKillAllOnExit ?

If I misunderstood some cases, please let me know. Thanks.

lifubang · 2019-04-01T00:46:22Z

And if we can't remove shouldKillAllOnExit, then with case (2) in my last comment, Container A joined the Container B's pid namespace:
I think Container B is just like a host, and Container A just use B's pid namespace, not a new pid namespace. So we should treat this situation like Not use a new pid namespace. So I submit this PR to fix shouldKillAllOnExit. Then all 3 cases in #3149 (comment) all supported very well.

fuweid · 2019-04-01T00:48:26Z

hi @Random-Liu

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

Yes. I was thinking the issue #2558 is not common case and shouldKillAllOnExit function is confusing me. When you use the same parent cgroup, we don't need shouldKillAllOnExit in reaped function.

lifubang · 2019-04-01T06:19:03Z

I think Container B is just like a host, and Container A just use B's pid namespace, not a new pid namespace. So we should treat this situation like Not use a new pid namespace.

Oh, this is right. I have test it by ctr, it will cause shim service stuck:

# we use busybox rootfs and /test1 to start container
root@demo:/opt/busybox.test# cat rootfs/test1
#!/bin/sh
sleep 100000&
while true; do
  wait || true
done

# start container test
root@demo:/opt/busybox.test# ctr run -d --rootfs ./rootfs test /test1
root@demo:/opt/busybox.test# ctr t ls
TASK    PID      STATUS    
test    17974    RUNNING
root@demo:/opt/busybox.test# ctr t ps test
PID      INFO
17974    &ProcessDetails{ExecID:test,}
17995    -

# start container test-id with pid ns of container test
root@demo:/opt/busybox.test# ctr run -d --with-ns "pid:/proc/17974/ns/pid" --rootfs ./rootfs test-pid /test1
root@demo:/opt/busybox.test# ctr t ls
TASK        PID      STATUS    
test        17974    RUNNING
test-pid    18067    RUNNING
root@demo:/opt/busybox.test# ctr t ps test-pid
PID      INFO
18067    &ProcessDetails{ExecID:test-pid,}
18088    -

# container test-id's init process is dead
root@demo:/opt/busybox.test# ctr t kill -s 9 test-pid
root@demo:/opt/busybox.test# ctr t ps test-pid
PID      INFO
18088    -
root@demo:/opt/busybox.test# ctr t ls
TASK        PID      STATUS    
test        17974    RUNNING
test-pid    18067    STOPPED

# try to delete the task test-pid, it causes shim service stuck
root@demo:/opt/busybox.test# ctr t delete test-pid
^C
root@demo:/opt/busybox.test# ctr t ls
^C
root@demo:/opt/busybox.test# ctr t ps test
PID      INFO
17974    &ProcessDetails{ExecID:test,}
17995    -
root@demo:/opt/busybox.test# ctr t ps test-pid
^C
root@demo:/opt/busybox.test#

And I have test it in runc, it has the same result.
So if we don't want to delete shouldKillAllOnExit check, we should fix it by check path's value.

@crosbymichael @Random-Liu @fuweid PTAL.

lifubang · 2019-04-01T14:55:32Z

Hi, everyone. I don't know why we need to use the same cgroup path in two containers? Which area it is used for?
If it is useless. I have a draft:

Disable two containers use the same cgroup path;
Then remove shouldKillAllOnExit func.

fuweid · 2019-04-01T15:09:15Z

+1 to remove the shouldKillAllOnExit func

lifubang · 2019-04-01T15:55:59Z

+1 to remove the shouldKillAllOnExit func

Yes, we should make a decision, remove or use this patch to support everything.

Random-Liu · 2019-04-01T16:32:25Z

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

NVM. I was on my phone during the weekend, so didn't get a chance to read things through. #2558 does require sharing cgroups as you both mentioned :), which I didn't pay attention to.

At least for both Moby and CRI use case, we need kill all for shared pid namespace containers.
I don't quite understand the use case in #2558. Why not just put the original container into a parent cgroup, and create the sidecar in the same parent cgroup? @BooleanCat

fuweid · 2019-04-02T00:41:39Z

also cc @georgethebeatle @ostenbom

lifubang · 2019-04-05T10:04:00Z

How about this one? I think if we can't make a decision whether delete shouldKillAllOnExit function or not, we should fix shim service stuck when join other pid namespace with this patch.

We can open a new PR after we decide to delete it.

fuweid · 2019-04-23T01:40:40Z

ping @georgethebeatle ~

fuweid · 2019-05-07T14:52:29Z

ping @georgethebeatle @danail-branekov again~ we need to know the user case here. Thanks

danail-branekov · 2019-05-07T16:11:52Z

Hi all,

So issue #2558 was regarding sandbox container being killed when the sidecar stops. That immediate issue was fixed with PR #2597. However, we overlooked that the PR would result into not killing the sidecar container which seems to be fixed by the current PR. Therefore we think that this PR is fine, maybe you should just add a test to make sure that the sidecar container process gets killed indeed.

Why not just put the original container into a parent cgroup, and create the sidecar in the same parent cgroup?

In CF Garden there is a use case that the sidecar container may have different limits and therefore should be created in a dedicated cgroup.

cc @georgethebeatle

fuweid · 2019-05-09T01:30:26Z

Thanks @danail-branekov for the user case.

With this PR change, we have several cases here:

Case1: share the same path of cgroup, but not same pid-namespace

No one container will kill all processes when it quits. kernel will kill all processes in the pid-namespace when the init process is dead. No worry about the leaking processes. I think it is the CF Garden user case.

Case2: share the same path of cgroup and same pid-namespace

containerd-shim kills all process in the same cgroup when container quits. If the no-host container quits, all the processes in the same pid-namespace will be killed because the host container init process is killed by runC. I think the result is reasonable.

Case3: No share the same path of cgroup but share the same pid-namespace

containerd-shim kills all processes in the container but it will not impact the host container.

Case 4: Share nothing

Each container has own pid-namespace and kernel will kill all processes in the pid-namespace when the init process is dead. No worry about the leaking processes.

The change is LGTM.

cc @Random-Liu @georgethebeatle @lifubang

Random-Liu · 2019-05-09T02:13:06Z

I like the fix! :D And I'm happy that it works for Garden!

Regardless of sharing cgroups or not, container has its own pid namespace, init process dies, no kill all is needed. Very straight forward.

fuweid

LGTM

estesp

LGTM

Would be good to get @crosbymichael and someone from CF Garden to also approve just to confirm.

danail-branekov · 2019-05-09T10:31:42Z

@estesp The PR looks good from CF Garden point of view, all our tests are fine (@georgethebeatle and myself are Garden representatives)

crosbymichael · 2019-05-09T18:28:41Z

LGTM

tedyu · 2020-02-26T18:07:37Z

@thaJeztah @crosbymichael
It seems runtime/v2/runc/v1/service.go doesn't have this change.

Should the same change be made there ?

fuweid · 2020-02-27T04:22:39Z

It seems runtime/v2/runc/v1/service.go doesn't have this change.

Should the same change be made there ?

@tedyu we are missing that part. could you help to file pr to handle this? thanks!

thaJeztah · 2020-02-27T08:07:29Z

^^ pr was opened here: #4063 (and merged already); thanks @tedyu !

lifubang added 2 commits March 30, 2019 11:36

fix shouldKillAllOnExit check

fa5f744

Signed-off-by: Lifubang <lifubang@acmcoder.com>

fix shouldKillAllOnExit check for v2

8722966

Signed-off-by: Lifubang <lifubang@acmcoder.com>

fuweid reviewed Mar 30, 2019

View reviewed changes

fuweid mentioned this pull request Mar 30, 2019

docker cannot terminate a container when processes linger after TERM moby/moby#38978

Closed

lifubang mentioned this pull request Mar 31, 2019

Fixes #38978 should kill all children process when delete the task moby/moby#38980

Closed

lifubang mentioned this pull request Apr 3, 2019

check alive child process when delete task #3168

Closed

thaJeztah mentioned this pull request Apr 5, 2019

Prepare v1.2.6 release #3180

Merged

lifubang mentioned this pull request Apr 9, 2019

Proposal: just signal all processes inside the container opencontainers/runc#2037

Closed

fuweid approved these changes May 9, 2019

View reviewed changes

estesp approved these changes May 9, 2019

View reviewed changes

crosbymichael merged commit 57fbb16 into containerd:master May 9, 2019

thaJeztah mentioned this pull request May 10, 2019

[release/1.2 backport] fix killall when use pidnamespace #3274

Merged

thaJeztah mentioned this pull request Jun 12, 2019

Zombie reaping not occurring as expected moby/moby#39326

Closed

tedyu mentioned this pull request Feb 27, 2020

fix killall when use pidnamespace #4063

Merged

thaJeztah added the cherry-picked/1.2.x label Feb 27, 2020

This was referenced Feb 27, 2020

[release/1.2 backport] fix killall when use pidnamespace #4064

Merged

[release/1.3 backport] fix killall when use pidnamespace #4065

Merged

lifubang deleted the pidnamespace branch May 28, 2020 01:37

	func TestContainerKillInitPidHost(t *testing.T) {
	initContainerAndCheckChildrenDieOnKill(t, oci.WithHostNamespace(specs.PIDNamespace))
	}

	func TestContainerKillInitKillsChildWhenNotHostPid(t *testing.T) {
	initContainerAndCheckChildrenDieOnKill(t)
	}

fix killall when use pidnamespace #3149

fix killall when use pidnamespace #3149

Uh oh!

Conversation

lifubang commented Mar 30, 2019

Uh oh!

codecov-io commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

fuweid commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Mar 30, 2019

Uh oh!

thaJeztah commented Mar 30, 2019

Uh oh!

Random-Liu commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifubang commented Mar 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifubang commented Mar 31, 2019

Uh oh!

fuweid commented Mar 31, 2019

Uh oh!

lifubang commented Mar 31, 2019

Uh oh!

fuweid commented Mar 31, 2019

Uh oh!

fuweid commented Mar 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Random-Liu commented Mar 31, 2019 via email

Uh oh!

lifubang commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Share cgroup: two containers have the same cgroup path

2. Share pid namespace: Container A joined the Container B's pid namespace

3. Use the same parent cgroup: like k8s pod

Uh oh!

lifubang commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Apr 1, 2019

Uh oh!

lifubang commented Apr 1, 2019

Uh oh!

lifubang commented Apr 1, 2019

Uh oh!

fuweid commented Apr 1, 2019

Uh oh!

lifubang commented Apr 1, 2019

Uh oh!

Random-Liu commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Apr 2, 2019

Uh oh!

lifubang commented Apr 5, 2019

Uh oh!

fuweid commented Apr 23, 2019

Uh oh!

fuweid commented May 7, 2019

Uh oh!

danail-branekov commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Case1: share the same path of cgroup, but not same pid-namespace

Case2: share the same path of cgroup and same pid-namespace

Case3: No share the same path of cgroup but share the same pid-namespace

Case 4: Share nothing

Uh oh!

Random-Liu commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

codecov-io commented Mar 30, 2019 •

edited

Loading

fuweid commented Mar 30, 2019 •

edited

Loading

Random-Liu commented Mar 30, 2019 •

edited

Loading

lifubang commented Mar 31, 2019 •

edited

Loading

fuweid commented Mar 31, 2019 •

edited

Loading

lifubang commented Apr 1, 2019 •

edited

Loading

lifubang commented Apr 1, 2019 •

edited

Loading

Random-Liu commented Apr 1, 2019 •

edited

Loading

danail-branekov commented May 7, 2019 •

edited

Loading

fuweid commented May 9, 2019 •

edited

Loading

Random-Liu commented May 9, 2019 •

edited

Loading