-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix pidfd leak in UnshareAfterEnterUserns #12167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @jfernandez. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
We tracked this down after noticing that open file descriptors by containerd (via the |
|
/ok-to-test |
|
/retest |
UnshareAfterEnterUserns() creates a pidfd via os.StartProcess() with CLONE_PIDFD but fails to close the file descriptor in any code path, resulting in a file descriptor leak for every container that uses user namespace isolation. The leak occurs because: - The pidfd is created when PidFD field is set in SysProcAttr - The original defer block only calls PidfdSendSignal() and pidfdWaitid() - No code path calls unix.Close(pidfd) to release the file descriptor This causes one pidfd leak per container launch when user namespace isolation is enabled (e.g., Kubernetes pods with hostUsers: false). In production environments with high container churn, this can exhaust the system's file descriptor limit. Fix the leak by adding a defer statement immediately after process creation that ensures unix.Close(pidfd) is always called, regardless of which code path is taken. This guarantees cleanup even if the function returns early due to errors or lack of pidfd support. This follows the same cleanup pattern already established in core/mount/mount_idmapped_utils_linux.go:getUsernsFD() which properly closes its pidfd. Closes: containerd#12166 Signed-off-by: Jose Fernandez <josef@netflix.com>
|
Hi @jfernandez thanks for the patch. The change looks good. Just have quick question: did you see |
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We do not see that error message in our logs. We also confirmed that our kernel supports pidfd. |
|
/retest |
dcantah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
|
Shall we cherry-pick this to 2.1 and 2.0? |
|
/cherry-pick release/2.0 |
|
@fuweid: new pull request created: #12178 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@fuweid: new pull request created: #12179 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
UnshareAfterEnterUserns() creates a pidfd via os.StartProcess() with CLONE_PIDFD but fails to close the file descriptor in any code path, resulting in a file descriptor leak for every container that uses user namespace isolation.
The leak occurs because:
This causes one pidfd leak per container launch when user namespace isolation is enabled (e.g., Kubernetes pods with hostUsers: false). In production environments with high container churn, this can exhaust the system's file descriptor limit.
Fix the leak by adding a defer statement immediately after process creation that ensures unix.Close(pidfd) is always called, regardless of which code path is taken. This guarantees cleanup even if the function returns early due to errors or lack of pidfd support.
This follows the same cleanup pattern already established in core/mount/mount_idmapped_utils_linux.go:getUsernsFD(), which properly closes its pidfd.
Closes: #12166
Fixes: #10607