Fix netns leak on container creation and exit code 1 on SIGTERM. by Luap99 · Pull Request #24082 · containers/podman

Luap99 · 2024-09-26T14:28:38Z

libpod: ensure we are not killed during netns creation

When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.

Fixes #24044

libpod: rework shutdown handler flow

Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.

The original motivation was to exit 0 for the podman.service in commit
ca7376b. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.

So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.

libpod: remove shutdown.Unregister()

It is never used and needed so let's just remove some dead code.

Does this PR introduce a user-facing change?

Podman no longer exits 0 on SIGTERM by default.
Fixed a race that could cause podman to leak netns files when it was interrupted during the netns creation.

When we are killed during netns setup it will leak the netns path as it was not commited in the db. This is rather common if you run systemctl stop on a podman systemd unit. Of course we cannot protect against SIGKILL but in systemd case we get SIGTERM and we really should not exit in a critical section like this. Fixes containers#24044 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Currently podman run -d can exit 0 if we send SIGTERM during startup even though the contianer was never started. That just doesn't make any sense is horribly confusing for a external job manager like systemd. The original motivation was to exit 0 for the podman.service in commit ca7376b. That does make sense but it should only do so for the service and only if the server did indeed gracefully shutdown. So we rework how the exit logic works, do not let the handler perform the exit. Instead the shutdown package does the exit after all handlers are run, this solves the issue of ordering. Then we default to exit code 1 like we did before and allow the service exit handler to overwrite the exit code 0 in case of a graceful shutdown. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

It is never used and needed so let's just remove some dead code. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2024-09-26T14:28:51Z

@mheon @edsantiago PTAL

openshift-ci · 2024-09-26T14:29:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

packit-as-a-service · 2024-09-26T14:31:57Z

Ephemeral COPR build failed. @containers/packit-build please check.

edsantiago · 2024-09-26T15:14:10Z

LGTM, and I can no longer reproduce the netns leak

mheon · 2024-09-26T17:26:59Z

libpod/shutdown/handler.go

 			}
 			handlerLock.Unlock()
 			shutdownInhibit.Unlock()
+			os.Exit(exitCode)


Wouldn't this prevent a lot of defer functions from running? I know I avoided it deliberately when I wrote this

I mean yes but if we exit what do we want to defer here? And most importantly the current handler also just exited so there should not be any functional difference I would say

IIRC we have a bunch of things running in defer that do things like removing files to clean up after ourselves

Sure but this PR doesn't change the behavior here. Critical sections where we cannot leak need to use shutdown.Inhibit() which is how I fixed the issue with the netns leak.

baude · 2024-09-30T15:48:58Z

LGTM, but defereing for merge to @mheon given his questions ...

edsantiago · 2024-10-01T20:04:54Z

Saw the netns leak flake today, almost wept in despair, then remembered that this PR hasn't merged yet.

mheon · 2024-10-01T20:58:04Z

/lgtm
I am hesitant but I cannot remember why we didn't want to use exit so I'll merge and hopefully we don't find out the hard way later

Luap99 · 2024-10-02T08:30:44Z

I am hesitant but I cannot remember why we didn't want to use exit so I'll merge and hopefully we don't find out the hard way later

The old code did use exit() there is no functional change in that regard.

mheon · 2024-10-02T12:10:30Z

LGTM

Luap99 added 3 commits September 26, 2024 15:39

libpod: remove shutdown.Unregister()

5de7b7c

It is never used and needed so let's just remove some dead code. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

openshift-ci bot added the release-note label Sep 26, 2024

github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Sep 26, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024

Luap99 mentioned this pull request Sep 26, 2024

CI: system tests: netns leak #24044

Closed

Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Sep 26, 2024

edsantiago changed the title ~~Fix netns leak on contianer creation and exit code 0 on SIGTERM.~~ Fix netns leak on container creation and exit code 0 on SIGTERM. Sep 26, 2024

mheon reviewed Sep 26, 2024

View reviewed changes

Luap99 changed the title ~~Fix netns leak on container creation and exit code 0 on SIGTERM.~~ Fix netns leak on container creation and exit code 1 on SIGTERM. Sep 26, 2024

openshift-ci bot assigned mheon Oct 1, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2024

openshift-merge-bot bot merged commit 857a47d into containers:main Oct 1, 2024

Luap99 deleted the netns-leak branch October 2, 2024 08:29

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 1, 2025

stale-locking-app bot locked as resolved and limited conversation to collaborators Jan 1, 2025

Conversation

Luap99 commented Sep 26, 2024

Does this PR introduce a user-facing change?

Uh oh!

Luap99 commented Sep 26, 2024

Uh oh!

openshift-ci bot commented Sep 26, 2024

Uh oh!

packit-as-a-service bot commented Sep 26, 2024

Uh oh!

edsantiago commented Sep 26, 2024

Uh oh!

mheon Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

Luap99 Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

mheon Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

Luap99 Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

baude commented Sep 30, 2024

Uh oh!

edsantiago commented Oct 1, 2024

Uh oh!

mheon commented Oct 1, 2024

Uh oh!

Luap99 commented Oct 2, 2024

Uh oh!

mheon commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants