Skip to content

fix network cleanup flake in play kube#23457

Merged
mheon merged 1 commit intocontainers:mainfrom
Luap99:play-kube-cleanup
Jul 31, 2024
Merged

fix network cleanup flake in play kube#23457
mheon merged 1 commit intocontainers:mainfrom
Luap99:play-kube-cleanup

Conversation

@Luap99
Copy link
Copy Markdown
Member

@Luap99 Luap99 commented Jul 31, 2024

When using service containers and play kube we create a complicated set of dependencies.

First in a pod all conmon/container cgroups are part of one slice, that slice will be removed when the entire pod is stopped resulting in systemd killing all processes that were part in it.

Now the issue here is around the working of stopPodIfNeeded() and stopIfOnlyInfraRemains(), once a container is cleaned up it will check if the pod should be stopped depending on the pod ExitPolicy. If this is the case it wil stop all containers in that pod. However in our flaky test we calle podman pod kill which logically killed all containers already. Thus the logic now thinks on cleanup it must stop the pod and calls into pod.stopWithTimeout(). Then there we try to stop but because all containers are already stopped it just throws errors and never gets to the point were it would call Cleanup(). So the code does not do cleanup and eventually calls removePodCgroup() which will cause all conmon and other podman cleanup processes of this pod to be killed.

Thus the podman container cleanup process was likely killed while actually trying to the the proper cleanup which leaves us in a bad state.

Following commands such as podman pod rm will try to the cleanup again as they see it was not completed but then fail as they are unable to recover from the partial cleanup state.

Long term network cleanup needs to be more robust and ideally should be idempotent to handle cases were cleanup was killed in the middle.

Fixes #21569

Does this PR introduce a user-facing change?

Fixed a race which could lead to improper network cleanup when using podman play kube.

When using service containers and play kube we create a complicated set
of dependencies.

First in a pod all conmon/container cgroups are part of one slice, that
slice will be removed when the entire pod is stopped resulting in
systemd killing all processes that were part in it.

Now the issue here is around the working of stopPodIfNeeded() and
stopIfOnlyInfraRemains(), once a container is cleaned up it will check
if the pod should be stopped depending on the pod ExitPolicy. If this is
the case it wil stop all containers in that pod. However in our flaky
test we calle podman pod kill which logically killed all containers
already. Thus the logic now thinks on cleanup it must stop the pod and
calls into pod.stopWithTimeout(). Then there we try to stop but because
all containers are already stopped it just throws errors and never gets
to the point were it would call Cleanup(). So the code does not do
cleanup and eventually calls removePodCgroup() which will cause all
conmon and other podman cleanup processes of this pod to be killed.

Thus the podman container cleanup process was likely killed while
actually trying to the the proper cleanup which leaves us in a bad
state.

Following commands such as podman pod rm will try to the cleanup again
as they see it was not completed but then fail as they are unable to
recover from the partial cleanup state.

Long term network cleanup needs to be more robust and ideally should be
idempotent to handle cases were cleanup was killed in the middle.

Fixes containers#21569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Jul 31, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Jul 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2024
@mheon
Copy link
Copy Markdown
Member

mheon commented Jul 31, 2024

LGTM

@edsantiago
Copy link
Copy Markdown
Member

LGTM, and seems to be passing in CI (#17831, #23275)

@edsantiago edsantiago added the 5.2 label Jul 31, 2024
@edsantiago
Copy link
Copy Markdown
Member

Maybe a stupid question, but is it possible that this PR makes the int-test flake worse? My (unscientific) perception is that 17831 is seeing the "unable to clean network" flake a lot more: debian, f40, and rawhide

@Luap99
Copy link
Copy Markdown
Member Author

Luap99 commented Jul 31, 2024

Maybe a stupid question, but is it possible that this PR makes the int-test flake worse? My (unscientific) perception is that 17831 is seeing the "unable to clean network" flake a lot more: debian, f40, and rawhide

I don't know, it is possible that callers that didn't call cleanup() due the falling stop call now do correctly call cleanup() so the error might appear in more places. But that should only happen on code paths that use pods. If no pods are used this code should never be called and your logs show normal containers without any pods failing too so I don't see how this would be related.

@mheon
Copy link
Copy Markdown
Member

mheon commented Jul 31, 2024

I'm going to merge to get this ready for the release.
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2024
@mheon
Copy link
Copy Markdown
Member

mheon commented Jul 31, 2024

The mergebot appears to be broken?

@mheon mheon merged commit 96eb637 into containers:main Jul 31, 2024
@Luap99 Luap99 deleted the play-kube-cleanup branch July 31, 2024 18:37
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 30, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

5.2 approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0

3 participants