Skip to content

Conversation

@jfernandez
Copy link
Contributor

@jfernandez jfernandez commented Aug 5, 2025

UnshareAfterEnterUserns() creates a pidfd via os.StartProcess() with CLONE_PIDFD but fails to close the file descriptor in any code path, resulting in a file descriptor leak for every container that uses user namespace isolation.

The leak occurs because:

  • The pidfd is created when PidFD field is set in SysProcAttr
  • The defer block only calls PidfdSendSignal() and pidfdWaitid()
  • No code path calls unix.Close(pidfd) to release the file descriptor

This causes one pidfd leak per container launch when user namespace isolation is enabled (e.g., Kubernetes pods with hostUsers: false). In production environments with high container churn, this can exhaust the system's file descriptor limit.

Fix the leak by adding a defer statement immediately after process creation that ensures unix.Close(pidfd) is always called, regardless of which code path is taken. This guarantees cleanup even if the function returns early due to errors or lack of pidfd support.

This follows the same cleanup pattern already established in core/mount/mount_idmapped_utils_linux.go:getUsernsFD(), which properly closes its pidfd.

Closes: #12166
Fixes: #10607

@k8s-ci-robot
Copy link

Hi @jfernandez. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jfernandez
Copy link
Contributor Author

jfernandez commented Aug 5, 2025

We tracked this down after noticing that open file descriptors by containerd (via the process_open_fds metric) always incremented and were only reset during containerd restarts. We wrote a bpftrace script to show all clone syscalls from containerd processes that use the CLONE_PIDFD flag.

tracepoint:raw_syscalls:sys_enter
  /comm == "containerd" && args->id == 56/  // clone syscall
  {
      if (args->args[0] & 0x1000) {  // CLONE_PIDFD flag
          @clone_tid[tid] = 1;
          @clone_flags[tid] = args->args[0];
      }
  }

tracepoint:raw_syscalls:sys_exit
/comm == "containerd" && args->id == 56 && @clone_tid[tid]/
  {
      if (args->ret > 0) {
          time("%H:%M:%S ");
          printf("| CLONE_PIDFD | child_pid=%d flags=0x%x\n", args->ret, @clone_flags[tid]);
          printf("         | STACK | \n");
          print(ustack);
          printf("\n");
      }
      delete(@clone_tid[tid]);
      delete(@clone_flags[tid]);
  }

  END
  {
      clear(@clone_tid);
      clear(@clone_flags);
  }
20:26:42 | CLONE_PIDFD | child_pid=105077 flags=0x10001011
         | STACK | 
        syscall.rawVforkSyscall.abi0+45
        syscall.forkAndExecInChild+85
        syscall.forkExec+844
        os.startProcess+754
        os.StartProcess+84
        github.com/containerd/containerd/v2/pkg/sys.UnshareAfterEnterUserns+520
        github.com/containerd/containerd/v2/internal/cri/server.(*criService).setupNetnsWithinUserns+649
        github.com/containerd/containerd/v2/internal/cri/server.(*criService).RunPodSandbox+10220
        github.com/containerd/containerd/v2/plugins/cri.(*criGRPCServer).RunPodSandbox+49
        github.com/containerd/containerd/v2/internal/cri/instrument.(*instrumentedService).RunPodSandbox+451
        k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler.func1+206
        github.com/containerd/containerd/v2/cmd/containerd/server.New.(*ServerMetrics).UnaryServerInterceptor.UnaryServerInterceptor.func11+650
        google.golang.org/grpc.getChainUnaryHandler.func1+178
        github.com/containerd/containerd/v2/cmd/containerd/server.unaryNamespaceInterceptor+101
        google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1+133
        k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_RunPodSandbox_Handler+323
        google.golang.org/grpc.(*Server).processUnaryRPC+3627
        google.golang.org/grpc.(*Server).handleStream+3723
        google.golang.org/grpc.(*Server).serveStreams.func2.1+127
        runtime.goexit.abi0+1

@austinvazquez
Copy link
Member

/ok-to-test

@jfernandez
Copy link
Contributor Author

/retest

UnshareAfterEnterUserns() creates a pidfd via os.StartProcess() with
CLONE_PIDFD but fails to close the file descriptor in any code path,
resulting in a file descriptor leak for every container that uses user
namespace isolation.

The leak occurs because:
- The pidfd is created when PidFD field is set in SysProcAttr
- The original defer block only calls PidfdSendSignal() and
  pidfdWaitid()
- No code path calls unix.Close(pidfd) to release the file descriptor

This causes one pidfd leak per container launch when user namespace
isolation is enabled (e.g., Kubernetes pods with hostUsers: false). In
production environments with high container churn, this can exhaust the
system's file descriptor limit.

Fix the leak by adding a defer statement immediately after process
creation that ensures unix.Close(pidfd) is always called, regardless of
which code path is taken. This guarantees cleanup even if the function
returns early due to errors or lack of pidfd support.

This follows the same cleanup pattern already established in
core/mount/mount_idmapped_utils_linux.go:getUsernsFD() which properly
closes its pidfd.

Closes: containerd#12166
Signed-off-by: Jose Fernandez <josef@netflix.com>
@fuweid
Copy link
Member

fuweid commented Aug 7, 2025

Hi @jfernandez thanks for the patch. The change looks good.

Just have quick question: did you see failed to ensure the kernel supports pidfd in the log?

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid added cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch impact/changelog labels Aug 7, 2025
@fuweid fuweid enabled auto-merge August 7, 2025 03:38
@jfernandez
Copy link
Contributor Author

jfernandez commented Aug 7, 2025

Just have quick question: did you see failed to ensure the kernel supports pidfd in the log?

We do not see that error message in our logs. We also confirmed that our kernel supports pidfd.

@jfernandez
Copy link
Contributor Author

/retest

Copy link
Member

@dcantah dcantah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@fuweid fuweid added this pull request to the merge queue Aug 7, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 7, 2025
Merged via the queue into containerd:main with commit ef68642 Aug 7, 2025
120 of 122 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 7, 2025
@rata
Copy link
Contributor

rata commented Aug 7, 2025

Shall we cherry-pick this to 2.1 and 2.0?

@fuweid
Copy link
Member

fuweid commented Aug 7, 2025

/cherry-pick release/2.0
/cherry-pick release/2.1

@k8s-infra-cherrypick-robot

@fuweid: new pull request created: #12178

Details

In response to this:

/cherry-pick release/2.0
/cherry-pick release/2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@fuweid: new pull request created: #12179

Details

In response to this:

/cherry-pick release/2.0
/cherry-pick release/2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fuweid fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch and removed cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Aug 7, 2025
@dmcgowan dmcgowan changed the title sys: fix pidfd leak in UnshareAfterEnterUserns Fix pidfd leak in UnshareAfterEnterUserns Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch impact/changelog kind/bug ok-to-test size/XS

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

File descriptor (pidfd) leak with user namespaces

9 participants