Skip to content

Conversation

@lifubang
Copy link
Contributor

For issue moby/moby#38978 :
We should let KillAll run when pid namespace path is not empty.

Signed-off-by: Lifubang lifubang@acmcoder.com

Signed-off-by: Lifubang <lifubang@acmcoder.com>
Signed-off-by: Lifubang <lifubang@acmcoder.com>
@codecov-io
Copy link

codecov-io commented Mar 30, 2019

Codecov Report

Merging #3149 into master will decrease coverage by 4.08%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   49.25%   45.16%   -4.09%     
==========================================
  Files         100      111      +11     
  Lines        9415    11962    +2547     
==========================================
+ Hits         4637     5403     +766     
- Misses       3955     5727    +1772     
- Partials      823      832       +9
Flag Coverage Δ
#linux 49.25% <ø> (ø) ⬆️
#windows 40.49% <ø> (?)
Impacted Files Coverage Δ
snapshots/native/native.go 43.04% <0%> (-9.99%) ⬇️
metadata/snapshot.go 45.8% <0%> (-8.96%) ⬇️
archive/tar.go 43.79% <0%> (-7.07%) ⬇️
metadata/containers.go 47.97% <0%> (-6.62%) ⬇️
content/local/writer.go 57.84% <0%> (-6.36%) ⬇️
content/local/store.go 48.51% <0%> (-5.03%) ⬇️
metadata/images.go 57.57% <0%> (-4.99%) ⬇️
archive/tar_opts.go 28.57% <0%> (-4.77%) ⬇️
archive/compression/compression.go 58.69% <0%> (-4.7%) ⬇️
metadata/buckets.go 56.33% <0%> (-4.6%) ⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d0a06d...8722966. Read the comment docs.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

and @lifubang could you please to add the testcase for this? Thanks!
I think that original test doesn't cover this case

func TestContainerKillInitPidHost(t *testing.T) {
initContainerAndCheckChildrenDieOnKill(t, oci.WithHostNamespace(specs.PIDNamespace))
}
func TestContainerKillInitKillsChildWhenNotHostPid(t *testing.T) {
initContainerAndCheckChildrenDieOnKill(t)
}

@fuweid
Copy link
Member

fuweid commented Mar 30, 2019

I think it should be cherry-picked into v1.2 release.

cc @Random-Liu

@fuweid
Copy link
Member

fuweid commented Mar 30, 2019

Wait. Hmm. I checked the https://github.com/containerd/containerd/pull/2597/files and found it is cgroup share case, not pid namespace.

When I remove namespace check in my local

func (s *Service) checkProcesses(e runc.Exit) {
-       shouldKillAll, err := shouldKillAllOnExit(s.bundle)
-       if err != nil {
-               log.G(s.context).WithError(err).Error("failed to check shouldKillAll")
-       }
-
        for _, p := range s.allProcesses() {
                if p.Pid() == e.Pid {

-                       if shouldKillAll {
-                               if ip, ok := p.(*proc.Init); ok {
-                                       // Ensure all children are killed
-                                       if err := ip.KillAll(s.context); err != nil {
-                                               log.G(s.context).WithError(err).WithField("id", ip.ID()).
-                                                       Error("failed to kill init's children")
-                                       }
+                       if ip, ok := p.(*proc.Init); ok {
+                               // Ensure all children are killed
+                               if err := ip.KillAll(s.context); err != nil {
+                                       log.G(s.context).WithError(err).WithField("id", ip.ID()).
+                                               Error("failed to kill init's children")
                                }
                        }
                        p.SetExited(e.Status)

And using following script to test

# create pause container
➜  /tmp ctr run -d docker.io/library/busybox:1.25 pause sh -c "sleep 1000"

# check pid
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING

# run other pause container with namespace
➜  /tmp ctr run -d --with-ns 'pid:/proc/19934/ns/pid' docker.io/library/busybox:1.25 pause-too sh -c "sleep 1000"
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING
pause-too    20116    RUNNING
➜  /tmp cat /run/containerd/io.containerd.runtime.v1.linux/default/pause-too/config.json  | jq '.linux.namespaces[0]'
{
  "type": "pid",
  "path": "/proc/19934/ns/pid"
}

# kill the pause-too and pause container is still living
➜  /tmp ctr t kill -s SIGKILL -a pause-too
➜  /tmp ctr t ls
TASK         PID      STATUS
pause        19934    RUNNING
pause-too    20116    STOPPED

So I think we should remove the pid namespace check @lifubang

There is the killAll in runc https://github.com/opencontainers/runc/blob/master/libcontainer/init_linux.go#L474

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
func signalAllProcesses(m cgroups.Manager, s os.Signal) error {

@thaJeztah
Copy link
Member

Thanks! Yes this should be cherry-picked into the relevant release branches

@Random-Liu
Copy link
Member

Random-Liu commented Mar 30, 2019

I remember we fixed this in the cri plugin.

We didn’t fix this in the shim because that check was just added for shared pod namespace case. See #2558

We may want to fix this in docker instead if we think the issue above is a valid case we would like to support.

@lifubang
Copy link
Contributor Author

lifubang commented Mar 31, 2019

We may want to fix this in docker instead if we think the issue above is a valid case we would like to support.

I think this patch may be more easy to fix this situation.
And this patch also support #2558, because for shared pod namespace case, the pid namespace path value is "".

Is there any other situations I ignored?

@lifubang
Copy link
Contributor Author

@fuweid I think we can't remove pid namespace check before call killAll, because we should support containers with the same cgroup path.
So, I fixed it in moby. Please see moby/moby#38980
Does it make sense? if yes, this PR can be closed.

@fuweid
Copy link
Member

fuweid commented Mar 31, 2019

If my understand is correct, the share same cgroup path is not related to pid namespace.
If we don't remove the pid check and init process is dead, the processes created by init process will not be terminated, right?

@lifubang
Copy link
Contributor Author

If we don't remove the pid check and init process is dead, the processes created by init process will not be terminated, right?

  1. If use host pid namespace, without killAll, the children process in the cgroup will be killed by runc after the init process dead.
  2. And killAll in runc will kill all process in the cgroup.
    For these two reasons, we should not remove pid namespace check.

@fuweid
Copy link
Member

fuweid commented Mar 31, 2019

@lifubang thanks for explanation. I think we can close it because it supports for share cgroup case.

@fuweid
Copy link
Member

fuweid commented Mar 31, 2019

but I think we still need think the share cgroup case is what we need here because there might be leakage of process. When the container A joined other container B with pid namespace, the init process in A exits and the child processes are still alive until containerd B is dead. It might be problem here.

@Random-Liu
Copy link
Member

Random-Liu commented Mar 31, 2019 via email

@lifubang
Copy link
Contributor Author

lifubang commented Apr 1, 2019

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

My English is not very well. @Random-Liu , I think @fuweid 's meaning is only share pid namespace, but don't enter the same cgroup path. I think this is different from share cgroup.

Now, they are 3 situations:

1. Share cgroup: two containers have the same cgroup path

This is the case like the issue #2558
This situation needs shouldKillAllOnExit to check whether we use new pid namespace or not.

2. Share pid namespace: Container A joined the Container B's pid namespace

This is the case like the issue moby/moby#38978
I think this situation doesn't need shouldKillAllOnExit check. It should run killAll anyway.

3. Use the same parent cgroup: like k8s pod

I think this situation also don't need shouldKillAllOnExit check. It should run killAll anyway.

I think the case (2) and (3) is used more widely than the case (1) .
So like @fuweid mentioned in #3149 (comment) , we can remove shouldKillAllOnExit ?

If I misunderstood some cases, please let me know. Thanks.

@lifubang
Copy link
Contributor Author

lifubang commented Apr 1, 2019

And if we can't remove shouldKillAllOnExit, then with case (2) in my last comment, Container A joined the Container B's pid namespace:
I think Container B is just like a host, and Container A just use B's pid namespace, not a new pid namespace. So we should treat this situation like Not use a new pid namespace. So I submit this PR to fix shouldKillAllOnExit. Then all 3 cases in #3149 (comment) all supported very well.

@fuweid
Copy link
Member

fuweid commented Apr 1, 2019

hi @Random-Liu

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

Yes. I was thinking the issue #2558 is not common case and shouldKillAllOnExit function is confusing me. When you use the same parent cgroup, we don't need shouldKillAllOnExit in reaped function.

@lifubang
Copy link
Contributor Author

lifubang commented Apr 1, 2019

I think Container B is just like a host, and Container A just use B's pid namespace, not a new pid namespace. So we should treat this situation like Not use a new pid namespace.

Oh, this is right. I have test it by ctr, it will cause shim service stuck:

# we use busybox rootfs and /test1 to start container
root@demo:/opt/busybox.test# cat rootfs/test1
#!/bin/sh
sleep 100000&
while true; do
  wait || true
done

# start container test
root@demo:/opt/busybox.test# ctr run -d --rootfs ./rootfs test /test1
root@demo:/opt/busybox.test# ctr t ls
TASK    PID      STATUS    
test    17974    RUNNING
root@demo:/opt/busybox.test# ctr t ps test
PID      INFO
17974    &ProcessDetails{ExecID:test,}
17995    -

# start container test-id with pid ns of container test
root@demo:/opt/busybox.test# ctr run -d --with-ns "pid:/proc/17974/ns/pid" --rootfs ./rootfs test-pid /test1
root@demo:/opt/busybox.test# ctr t ls
TASK        PID      STATUS    
test        17974    RUNNING
test-pid    18067    RUNNING
root@demo:/opt/busybox.test# ctr t ps test-pid
PID      INFO
18067    &ProcessDetails{ExecID:test-pid,}
18088    -

# container test-id's init process is dead
root@demo:/opt/busybox.test# ctr t kill -s 9 test-pid
root@demo:/opt/busybox.test# ctr t ps test-pid
PID      INFO
18088    -
root@demo:/opt/busybox.test# ctr t ls
TASK        PID      STATUS    
test        17974    RUNNING
test-pid    18067    STOPPED

# try to delete the task test-pid, it causes shim service stuck
root@demo:/opt/busybox.test# ctr t delete test-pid
^C
root@demo:/opt/busybox.test# ctr t ls
^C
root@demo:/opt/busybox.test# ctr t ps test
PID      INFO
17974    &ProcessDetails{ExecID:test,}
17995    -
root@demo:/opt/busybox.test# ctr t ps test-pid
^C
root@demo:/opt/busybox.test#

And I have test it in runc, it has the same result.
So if we don't want to delete shouldKillAllOnExit check, we should fix it by check path's value.

@crosbymichael @Random-Liu @fuweid PTAL.

@lifubang
Copy link
Contributor Author

lifubang commented Apr 1, 2019

Hi, everyone. I don't know why we need to use the same cgroup path in two containers? Which area it is used for?
If it is useless. I have a draft:

  1. Disable two containers use the same cgroup path;
  2. Then remove shouldKillAllOnExit func.

@fuweid
Copy link
Member

fuweid commented Apr 1, 2019

+1 to remove the shouldKillAllOnExit func

@lifubang
Copy link
Contributor Author

lifubang commented Apr 1, 2019

+1 to remove the shouldKillAllOnExit func

Yes, we should make a decision, remove or use this patch to support everything.

@Random-Liu
Copy link
Member

Random-Liu commented Apr 1, 2019

Why share cgroup? Why not just put the 2 containers into the same parent cgroup?

NVM. I was on my phone during the weekend, so didn't get a chance to read things through. #2558 does require sharing cgroups as you both mentioned :), which I didn't pay attention to.

At least for both Moby and CRI use case, we need kill all for shared pid namespace containers.
I don't quite understand the use case in #2558. Why not just put the original container into a parent cgroup, and create the sidecar in the same parent cgroup? @BooleanCat

@fuweid
Copy link
Member

fuweid commented Apr 2, 2019

also cc @georgethebeatle @ostenbom

@lifubang
Copy link
Contributor Author

lifubang commented Apr 5, 2019

How about this one? I think if we can't make a decision whether delete shouldKillAllOnExit function or not, we should fix shim service stuck when join other pid namespace with this patch.

We can open a new PR after we decide to delete it.

@fuweid
Copy link
Member

fuweid commented Apr 23, 2019

ping @georgethebeatle ~

@fuweid
Copy link
Member

fuweid commented May 7, 2019

ping @georgethebeatle @danail-branekov again~ we need to know the user case here. Thanks

@danail-branekov
Copy link
Contributor

danail-branekov commented May 7, 2019

Hi all,

So issue #2558 was regarding sandbox container being killed when the sidecar stops. That immediate issue was fixed with PR #2597. However, we overlooked that the PR would result into not killing the sidecar container which seems to be fixed by the current PR. Therefore we think that this PR is fine, maybe you should just add a test to make sure that the sidecar container process gets killed indeed.

Why not just put the original container into a parent cgroup, and create the sidecar in the same parent cgroup?

In CF Garden there is a use case that the sidecar container may have different limits and therefore should be created in a dedicated cgroup.

cc @georgethebeatle

@fuweid
Copy link
Member

fuweid commented May 9, 2019

Thanks @danail-branekov for the user case.

With this PR change, we have several cases here:

Case1: share the same path of cgroup, but not same pid-namespace

No one container will kill all processes when it quits. kernel will kill all processes in the pid-namespace when the init process is dead. No worry about the leaking processes. I think it is the CF Garden user case.

Case2: share the same path of cgroup and same pid-namespace

containerd-shim kills all process in the same cgroup when container quits. If the no-host container quits, all the processes in the same pid-namespace will be killed because the host container init process is killed by runC. I think the result is reasonable.

Case3: No share the same path of cgroup but share the same pid-namespace

containerd-shim kills all processes in the container but it will not impact the host container.

Case 4: Share nothing

Each container has own pid-namespace and kernel will kill all processes in the pid-namespace when the init process is dead. No worry about the leaking processes.

The change is LGTM.

cc @Random-Liu @georgethebeatle @lifubang

@Random-Liu
Copy link
Member

Random-Liu commented May 9, 2019

I like the fix! :D And I'm happy that it works for Garden!

Regardless of sharing cgroups or not, container has its own pid namespace, init process dies, no kill all is needed. Very straight forward.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Would be good to get @crosbymichael and someone from CF Garden to also approve just to confirm.

@danail-branekov
Copy link
Contributor

@estesp The PR looks good from CF Garden point of view, all our tests are fine (@georgethebeatle and myself are Garden representatives)

@crosbymichael
Copy link
Member

LGTM

@tedyu
Copy link
Contributor

tedyu commented Feb 26, 2020

@thaJeztah @crosbymichael
It seems runtime/v2/runc/v1/service.go doesn't have this change.

Should the same change be made there ?

@fuweid
Copy link
Member

fuweid commented Feb 27, 2020

It seems runtime/v2/runc/v1/service.go doesn't have this change.

Should the same change be made there ?

@tedyu we are missing that part. could you help to file pr to handle this? thanks!

@thaJeztah
Copy link
Member

^^ pr was opened here: #4063 (and merged already); thanks @tedyu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants