Skip to content

Conversation

@laurazard
Copy link
Member

@laurazard laurazard commented Aug 18, 2024

This PR moves the init exit processing in s.handleStarted out of the if exited block – regardless of whether the exec was an early exit or not, if it was the last pending exec we should process it's exit. We've already removed c from s.pendingExecs, and this being the last pending exec means that after this we'll never process the exit if it doesn't happen here.

This is important to ensure correctness in the shared-pid namespace scenario (together with the changes from #10634) – with these changes we:

  • prevent new execs from being started after the init process of the container has been reaped, and
  • ensure that even if the last exec succeeds (as it may in the shared-pid namespace scenario) we don't drop the init exit.

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ningmingxiao
Copy link
Contributor

ningmingxiao commented Aug 18, 2024

exec process may have child process. if father process exited, process 1 will be child process' mother.It may be not be killed.
and we change the behavior, we can't exec a container if init process exited.

@laurazard
Copy link
Member Author

Sorry, I don't quite understand what you mean. This PR doesn't change anything about which processes are killed, just when. We have a check – runc.ShouldKillAllOnExit, which checks whether the container has a private PID namespace. This is the same as before.

However, this is only called when we're about to send the exit event for the init process – which is now delayed if there are pending execs. We shouldn't delay this – we can run this logic as soon as we receive the exit for the init process, regardless of whether we're going to delay sending that exit or not.

@laurazard laurazard changed the base branch from release/1.7 to main August 18, 2024 21:33
@laurazard laurazard force-pushed the runc-shim-missing-init-exit-fix branch from 8db9224 to 1c4fd15 Compare August 18, 2024 21:36
@laurazard laurazard force-pushed the runc-shim-missing-init-exit-fix branch from 1c4fd15 to 47f0829 Compare August 18, 2024 22:07
@ningmingxiao
Copy link
Contributor

ningmingxiao commented Aug 19, 2024

Sorry, I don't quite understand what you mean. This PR doesn't change anything about which processes are killed, just when. We have a check – runc.ShouldKillAllOnExit, which checks whether the container has a private PID namespace. This is the same as before.

However, this is only called when we're about to send the exit event for the init process – which is now delayed if there are pending execs. We shouldn't delay this – we can run this logic as soon as we receive the exit for the init process, regardless of whether we're going to delay sending that exit or not.

crictl exec -it containerdA sh
sleep 1000&

sleep will not be killed by ip.KillAll

@laurazard
Copy link
Member Author

laurazard commented Aug 19, 2024

That might be true, but that's not a new feature of this PR.

Also, wouldn't sleep get reparented in this case?

@laurazard laurazard force-pushed the runc-shim-missing-init-exit-fix branch from 47f0829 to 1c8283a Compare August 19, 2024 13:03
@ningmingxiao
Copy link
Contributor

ningmingxiao commented Aug 19, 2024

I try this pr, issue still exist.

[root@LIN-FB738BFD367 pods]# cat container.json 
{
  "metadata": {
    "name": "gpu"
  },
  "image": {
    "image": "docker.io/library/busybox:1.28"
  },
  "command": [
    "top"
  ],
  "tty": true,
  "linux": {
  }
}
[root@LIN-FB738BFD367 pods]# cat pod.json 
{
  "metadata": {
    "name": "sandbox-01027951909",
    "namespace": "default",
    "attempt": 1,
    "uid": "hdishd83djaidwnduwk28bcsb"
  },
  "log_directory": "/tmp",
  "linux": {
  }
}

package main

import (
        "fmt"
        "os/exec"
        "sync"
        "time"
)

func main() {
        go func ()  {
                time.Sleep(time.Second*1)
                runCmd("kill -9 `pidof top`")
        }()
        test()
        fmt.Println("done")
}

func runCmd(command string) {
        cmd := exec.Command("bash", "-c", command)
        err := cmd.Run()
        if err != nil {
                fmt.Println(err)
        }
}

func test() {
        var wg sync.WaitGroup
        const numContainers = 100
        for i := 1; i <= numContainers; i++ {
                wg.Add(1)
                go func(i int) {
                        defer wg.Done()
                        runCmd("crictl exec 7d307  date")
                }(i)
        }
        wg.Wait()
}
[root@LIN-FB738BFD367 pods]# crictl  run --no-pull container.json  pod.json
7d307a9dea00e40dc726aaeac172436ab0aeb8652bd407f804948713df296e35
[root@LIN-FB738BFD367 pods]# go run main.go
[root@LIN-FB738BFD367 pods]# crictl  ps
CONTAINER           IMAGE                            CREATED             STATE               NAME                ATTEMPT             POD ID              POD
7d307a9dea00e       docker.io/library/busybox:1.28   3 minutes ago       Running             gpu                 0                   d4c701491262a

top is exited
root     1322712  0.1  0.0 1241848 16184 pts/0   Sl   21:36   0:00 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id d4c701491262aa245be999b
root     1322744  0.0  0.0   1024     4 ?        Ss   21:36   0:00  \_ /pause

for k8s just set shareProcessNamespace =true (share pause pid ns) in pod yaml. (if we don't use crictl create pods)
and run main.go can easily reproduce.

@laurazard
Copy link
Member Author

As I mentioned before, I unfortunately can't repro – could you share containerd logs that you get when you reproduce the issue? If possible, with some extra logs such as 511ed30

@laurazard
Copy link
Member Author

laurazard commented Aug 19, 2024

for k8s just set shareProcessNamespace =true (share pause pid ns) in pod yaml. (if we don't use crictl create pods)
and run main.go can easily reproduce.

If this issue occurs when sharing the PID namespace, then we have no guarantee that by the time we receive the init's exit, we have/will soon receive the exec's exits. In the meantime, I think (need to confirm) that other execs could "start" execution/increment s.pendingExecs, meaning that we would continue delaying emitting the init exit up until a point where all the current pending exec exit subscribers no longer contain the init exit to emit.

I don't think we guaranteed ordering before all the recent changes when sharing a pid namespace. Maybe we shouldn't skip processing the init exit in that case.

@laurazard laurazard closed this Aug 23, 2024
@laurazard laurazard reopened this Aug 23, 2024
@laurazard laurazard force-pushed the runc-shim-missing-init-exit-fix branch from a1751c7 to 1c8283a Compare August 23, 2024 21:46
@laurazard laurazard marked this pull request as ready for review August 23, 2024 21:46
@dosubot dosubot bot added the area/runtime Runtime label Aug 23, 2024
If a container isn't running in it's own pid namespace/it's init process
is not the init process of the pid namespace, it's possible for an
exec started just before the init process exited/was reaped to start
successfully/not early exit.

If we previously skipped sending an init processes exit due to pending
execs, we need to emit it regardless of whether the pending exec was an
early exit or not, otherwise this exit won't get emitted at all.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard laurazard force-pushed the runc-shim-missing-init-exit-fix branch from 1c8283a to 8543b69 Compare August 23, 2024 21:51
@laurazard laurazard self-assigned this Aug 23, 2024
@laurazard laurazard requested a review from dmcgowan August 23, 2024 21:53
@laurazard
Copy link
Member Author

Superseded by #10647

@laurazard laurazard closed this Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants