Skip to content

Conversation

@laurazard
Copy link
Member

@laurazard laurazard commented Aug 28, 2024

Based off of #10647 and #10634 (and with inspiration from #10650), replaced s.pendingExecs with s.runningExecs. The logic is changed to keep track of running execs instead of pending execs, and handling a pending/prestarted exec the same as a running exec, which significantly simplifies the shim exit processing logic.

closes #10589


This PR rewrites and simplifies a lot of the shim exit processing logic to reduce it's complexity, and also handle the case where the container doesn't have it's own pid-namespace, which means that we're not guaranteed to receive the init exit last.

This is achieved by replacing s.pendingExecs with s.runningExecs, for which both (previously) pending and de facto running execs are considered.

The new exit handling logic can be summed up by:

  • when we receive an init exit, stash it it in s.containerInitExit, and if a container's init process has exited, refuse new execs.
  • (if the container does not have it's own pidns) kill all running processes (if the container has a private pid-namespace, then all processes will be dead already).
  • wait for the container's running exec count (which includes execs which have been started but might still early exit) to get to 0.
  • publish the stashed away init exit.

The issue with init exits being dropped if new execs keep coming in is resolved by refusing new execs after the container's init process has exited.

Big thanks to both @corhere for the guidance and discussion and @samuelkarp for the input, discussion and repro in #10649.

Copy link
Member

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

@laurazard laurazard force-pushed the shim-refactor-without-pending branch from 9ab1a15 to 7e07ecb Compare August 28, 2024 16:55
@laurazard laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 2c290ed to 655cb60 Compare August 28, 2024 17:09
@laurazard laurazard changed the title WIP: refactor runc-shim to remove s.pendingExecs runc-shim: fix races/prevent init exits from being dropped Aug 28, 2024
@laurazard laurazard force-pushed the shim-refactor-without-pending branch from 655cb60 to d3629b0 Compare August 29, 2024 09:25
@laurazard
Copy link
Member Author

/cc @samuelkarp @dmcgowan

It's not true that `s.mu` needs to be held when calling
`handleProcessExit`, and indeed hasn't been the case for a
while – see 892dc54.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
The runc task state machine prevents execs from being created after the
init process has exited, but there are no guards against starting a
created exec after the init process has exited. That leaves a small
window for starting an exec to race our handling of the init process
exiting. Normally this is not an issue in practice: the kernel will
atomically kill all processes in a PID namespace when its "init" process
terminates, and will not allow new processes to fork(2) into the PID
namespace afterwards. Therefore the racing exec is guaranteed by the
kernel to not be running after the init process terminates. On the other
hand, when the container does not have a private PID namespace (i.e. the
container's init process is not the "init" process of the container's
PID namespace), the kernel does not automatically kill other container
processes on init exit and will happily allow runc to start an exec
process at any time. It is the runc shim's responsibility to clean up
the container when the init process exits in this situation by killing
all the container's remaining processes. Block execs from being started
after the container's init process has exited to prevent the processes
from leaking, and to avoid violating the task service's assumption that
an exec can be running iff the init process is also running.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@laurazard laurazard force-pushed the shim-refactor-without-pending branch 2 times, most recently from 74a7288 to 957a2b7 Compare September 2, 2024 10:27
@dims
Copy link
Member

dims commented Sep 4, 2024

LGTM, looks like @cpuguy83 has a few more comments

@ningmingxiao
Copy link
Contributor

ningmingxiao commented Sep 5, 2024

func (e *execProcess) start(ctx context.Context) (err error) {
	// The reaper may receive exit signal right after
	// the container is started, before the e.pid is updated.
	// In that case, we want to block the signal handler to
	// access e.pid until it is updated.
	e.pid.Lock()
	defer e.pid.Unlock()

	var (
		socket  *runc.Socket
		pio     *processIO
		pidFile = newExecPidFile(e.path, e.id)
	)
	if e.stdio.Terminal {
		if socket, err = runc.NewTempConsoleSocket(); err != nil {
			return fmt.Errorf("failed to create runc console socket: %w", err)
		}
		defer socket.Close()
	} else {
		if pio, err = createIO(ctx, e.id, e.parent.IoUID, e.parent.IoGID, e.stdio); err != nil {
			return fmt.Errorf("failed to create init process I/O: %w", err)
		}
		e.io = pio
	}
	opts := &runc.ExecOpts{
		PidFile: pidFile.Path(),
		Detach:  true,
	}
	if pio != nil {
		opts.IO = pio.IO()
	}
	if socket != nil {
		opts.ConsoleSocket = socket
	}
	if err := e.parent.runtime.Exec(ctx, e.parent.id, e.spec, opts); err != nil {
		close(e.waitBlock)
		return e.parent.runtimeError(err, "OCI runtime exec failed")
	}
	if e.stdio.Stdin != "" {
		if err := e.openStdin(e.stdio.Stdin); err != nil {
			return err
		}
	}
	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()
	if socket != nil {
		console, err := socket.ReceiveMaster()
		if err != nil {
			return fmt.Errorf("failed to retrieve console master: %w", err)
		}
		if e.console, err = e.parent.Platform.CopyConsole(ctx, console, e.id, e.stdio.Stdin, e.stdio.Stdout, e.stdio.Stderr, &e.wg); err != nil {
			return fmt.Errorf("failed to start console copy: %w", err)
		}
	} else {
		if err := pio.Copy(ctx, &e.wg); err != nil {
			return fmt.Errorf("failed to start io pipe copy: %w", err)
		}
	}
	pid, err := pidFile.Read()
	if err != nil {
		return fmt.Errorf("failed to retrieve OCI runtime exec pi: %wd", err)
	}
	e.pid.pid = pid
	return nil
}

line 297 s.runningExecs[container]--
after e.parent.runtime.Exec is called (if e.parent.runtime.Exec is ok, runc exec ok) but after "e.parent.runtime.Exec" may generate other error, s.runningExecs[container] may reduce to wrong value. maybe we should record exec process pid.

Copy link
Member

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samuelkarp samuelkarp added this pull request to the merge queue Sep 5, 2024
@samuelkarp
Copy link
Member

/cherrypick release/1.7

@k8s-infra-cherrypick-robot

@samuelkarp: once the present PR merges, I will cherry-pick it on top of release/1.7 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 5, 2024
@samuelkarp samuelkarp added this pull request to the merge queue Sep 5, 2024
@dims
Copy link
Member

dims commented Sep 5, 2024

xref: #10603

Merged via the queue into containerd:main with commit c8b095f Sep 5, 2024
@k8s-infra-cherrypick-robot

@samuelkarp: #10651 failed to apply on top of branch "release/1.7":

Applying: runc-shim: remove misleading comment
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: refuse to start execs after init exits
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
Applying: runc-shim: handle pending execs as running
Using index info to reconstruct a base tree...
A	cmd/containerd-shim-runc-v2/task/service.go
Falling back to patching base and 3-way merge...
Auto-merging runtime/v2/runc/task/service.go
CONFLICT (content): Merge conflict in runtime/v2/runc/task/service.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0003 runc-shim: handle pending execs as running
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@laurazard
Copy link
Member Author

Opened backports: #10676 and #10675.

@samuelkarp samuelkarp added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Sep 6, 2024
laurazard added a commit that referenced this pull request Mar 24, 2025
These changes match the changes in
#10651.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/bug ok-to-test size/L

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

supposedly running container actually dead

9 participants