Skip to content

runc-shim: only defer init process exits#9999

Merged
estesp merged 1 commit intocontainerd:mainfrom
laurazard:fix-exec-concurrent-shim
Apr 5, 2024
Merged

runc-shim: only defer init process exits#9999
estesp merged 1 commit intocontainerd:mainfrom
laurazard:fix-exec-concurrent-shim

Conversation

@laurazard
Copy link
Member

What this fixes:

In order to make sure that we don't publish task exit events for init processes before we do for execs in that container, we added logic to processExits in 892dc54 to skip these and let the pending exec's handleStarted closure process them.

However, the conditional logic in processExits added was faulty - we should only defer processing of exit events related to init processes, not other execs. Due to this missing condition,
892dc54 introduced a bug where, if there are many concurrent execs for the same container/init pid, exec exits are skipped and then never published, resulting in hanging clients.

repro + details
ctrid=$(docker run -d busybox top)
docker exec $ctrid cat /proc/1/cgroup
docker exec $ctrid cat /proc/self/cgroup &
docker exec $ctrid cat /proc/self/cgroup &
docker exec $ctrid cat /proc/self/cgroup &
docker exec $ctrid cat /proc/self/cgroup &
docker exec $ctrid cat /proc/self/cgroup &
docker exec $ctrid cat /proc/self/cgroup &
wait

this hangs without this fix, but works before 892dc54.

Instrumenting with some logs, we see this issue:

diff --git a/cmd/containerd-shim-runc-v2/task/service.go b/cmd/containerd-shim-runc-v2/task/service.go
index 666e03603..506cbfe1a 100644
--- a/cmd/containerd-shim-runc-v2/task/service.go
+++ b/cmd/containerd-shim-runc-v2/task/service.go
@@ -210,11 +210,13 @@ func (s *service) preStart(c *runc.Container) (handleStarted func(*runc.Containe
 				}
 			}
 		}
+		log.G(s.context).Info("handleStarted ", pid)
 
 		ees, exited := exits[pid]
 		delete(s.exitSubscribers, &exits)
 		exits = nil
 		if pid == 0 || exited {
+			log.G(s.context).Info("early exit ", pid)
 			s.lifecycleMu.Unlock()
 			for _, ee := range ees {
 				s.handleProcessExit(ee, c, p)
@@ -225,6 +227,7 @@ func (s *service) preStart(c *runc.Container) (handleStarted func(*runc.Containe
 				}
 			}
 		} else {
+			log.G(s.context).Info("running ", pid)
 			// Process start was successful, add to `s.running`.
 			s.running[pid] = append(s.running[pid], containerProcess{
 				Container: c,
@@ -683,6 +686,7 @@ func (s *service) processExits() {
 				// order to ensure order between execs and the init process for a given
 				// container, skip processing this exit here and let the `handleStarted`
 				// closure for the pending exec publish it.
+				log.G(s.context).Info("skipping ", e.Pid)
 				skipped = append(skipped, cp)
 			} else {
 				cps = append(cps, cp)
TRAC[2024-03-26T13:27:36.805958001Z] event forwarded                               ns=moby topic=/tasks/exec-added type=containerd.events.TaskExecAdded
TRAC[2024-03-26T13:27:36.849531825Z] event forwarded                               ns=moby topic=/tasks/exec-started type=containerd.events.TaskExecStarted
time="2024-03-26T13:27:36.848788403Z" level=info msg="taskExecStarted 625d077662a3fac929e0ecb715c2b8e69468639ee46752fc043de071076a62ba pid 391509" runtime=io.containerd.runc.v2
time="2024-03-26T13:27:36.849284740Z" level=info msg="handleStarted 391509" runtime=io.containerd.runc.v2
time="2024-03-26T13:27:36.849291532Z" level=info msg="running 391509" runtime=io.containerd.runc.v2
time="2024-03-26T13:27:36.850655960Z" level=info msg="skipping 391509" runtime=io.containerd.runc.v2

What I did:

This commit adds the missing logic to processExits.

In order to make sure that we don't publish task exit events for init
processes before we do for execs in that container, we added logic to
`processExits` in 892dc54 to skip these
and let the pending exec's `handleStarted` closure process them.

However, the conditional logic in `processExits` added was faulty - we
should only defer processing of exit events related to init processes,
not other execs. Due to this missing condition,
892dc54 introduced a bug where, if
there are many concurrent execs for the same container/init pid, exec
exits are skipped and then never published, resulting in hanging
clients.

This commit adds the missing logic to `processExits`.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard
Copy link
Member Author

I'll add some tests too 😅 sad we didn't catch this here until we did in moby.

@dmcgowan dmcgowan added cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Apr 2, 2024
@rtheis
Copy link

rtheis commented Apr 5, 2024

Hi folks, thanks for the fix. Please note that the problem being fixed is very serious in that it makes a big mess of Kubernetes exec probes and kubectl exec command calls. The problem is very easy to recreate by running concurrent kubectl exec commands against the same pod. When I searched, I didn't find any issue associated with this bug. Would you like me to create one? Also, some type of warning about using containerd version 1.7.14 with Kubernetes would be good too. Thanks.

@laurazard
Copy link
Member Author

Hi @rtheis, thanks for letting us know. Feel free to create an issue – in the meantime, I'll ping people to get this reviewed+merged and open backports.

I'll leave tests out for another PR so we can get this merged quicker.

@laurazard laurazard requested a review from a team April 5, 2024 11:30
@laurazard
Copy link
Member Author

cc @dmcgowan

@rtheis
Copy link

rtheis commented Apr 5, 2024

#10036

for _, cp := range s.running[e.Pid] {
if s.pendingExecs[cp.Container] != 0 {
_, init := cp.Process.(*process.Init)
if init && s.pendingExecs[cp.Container] != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have a patch locally, I'll clean it up and open a PR with the tests.

@AkihiroSuda AkihiroSuda added this pull request to the merge queue Apr 5, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 5, 2024
@estesp estesp added this pull request to the merge queue Apr 5, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 5, 2024
@estesp estesp merged commit ac8f769 into containerd:main Apr 5, 2024
@estesp estesp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Apr 5, 2024
@dims
Copy link
Member

dims commented Apr 5, 2024

xref: bottlerocket-os/bottlerocket#3867

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/bug size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants