Skip to content

Add check for CNI plugins before tearing down pod network#10744

Merged
mxpv merged 1 commit intocontainerd:mainfrom
sameersaeed:sandbox-cni-plugins
Sep 30, 2024
Merged

Add check for CNI plugins before tearing down pod network#10744
mxpv merged 1 commit intocontainerd:mainfrom
sameersaeed:sandbox-cni-plugins

Conversation

@sameersaeed
Copy link
Copy Markdown
Contributor

Currently, locking behavior occurs if a user creates a sandbox pod and then tries to remove / delete it without having any CNI plugins initialized on their system.

Sample flow

pod-config.json:, per crictl docs

{
  "metadata": {
    "name": "nginx-sandbox",
    "namespace": "default",
    "attempt": 1,
    "uid": "65fb5845de97d8a4"
  },
  "log_directory": "/tmp",
  "linux": {
  }
}

Running sandbox pod - receive CNI plugin error:

root@3cd5ffda91af:~# crictl runp pod-config.json 
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:22.941363     315 log.go:32] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized"
FATA[0003] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized 

Pod shows up as NotReady:

root@3cd5ffda91af:~# crictl pods
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
c4c57d078fc28       270 years ago       NotReady            nginx-sandbox       default             1                   (default)

Locking behavior - cannot stop / remove pod due to CNI plugin error:

root@3cd5ffda91af:~# crictl stopp c4
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:43.024652     335 log.go:32] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized" podSandboxID="c4"
FATA[0000] stopping the pod sandbox "c4": rpc error: code = Unknown desc = failed to destroy network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized 
root@3cd5ffda91af:~# crictl rmp c4
WARN[0000] runtime connect using default endpoints: [unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
E0922 01:07:48.578086     345 log.go:32] "RemovePodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to forcibly stop sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": failed to destroy network for sandbox \"c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef\": cni plugin not initialized" podSandboxID="c4"
removing the pod sandbox "c4": rpc error: code = Unknown desc = failed to forcibly stop sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": failed to destroy network for sandbox "c4c57d078fc28ae83f4fd22167a23aa62e5ca77a09e9e1e62de72bc0135031ef": cni plugin not initialized

Changes made

Current code:

if err := c.teardownPodNetwork(ctx, sandbox); err != nil {
return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
}

Added a minor change to the above sandbox_stop.go code so that the pod network teardown will only be performed if CNIResult returns a valid value:

if sandbox.CNIResult != nil {
	if err := c.teardownPodNetwork(ctx, sandbox); err != nil {
		return fmt.Errorf("failed to destroy network for sandbox %q: %w", id, err)
	}	
}

This avoids the locking behavior, so the pods are able to be stopped and / or removed as expected:

root@03195a9aecb5:/src# crictl rmp c4
DEBU[0000] get runtime connection                       
Stopped sandbox c4
root@03195a9aecb5:/src# crictl rmp c4
DEBU[0000] get runtime connection                       
Removed sandbox c4
root@03195a9aecb5:/src# crictl pods
DEBU[0000] get runtime connection                       
DEBU[0000] ListPodSandboxRequest: &ListPodSandboxRequest{Filter:&PodSandboxFilter{Id:,State:nil,LabelSelector:map[string]string{},},} 
DEBU[0000] ListPodSandboxResponse: []                   
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
``

Signed-off-by: Sameer <sameer.saeed@live.ca>
@k8s-ci-robot
Copy link
Copy Markdown

Hi @sameersaeed. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot bot added area/cri Container Runtime Interface (CRI) kind/bug labels Sep 27, 2024
@mxpv
Copy link
Copy Markdown
Member

mxpv commented Sep 27, 2024

/ok-to-test

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow added the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Sep 30, 2024
@mikebrow
Copy link
Copy Markdown
Member

/test pull-containerd-k8s-e2e-ec2

@mikebrow
Copy link
Copy Markdown
Member

mikebrow commented Sep 30, 2024

Additionally, the following needs to be fixed in a follow up PR as the defer'd tear down will also fail and "block" the netns removal https://github.com/containerd/containerd/blob/release/1.6/pkg/cri/server/sandbox_run.go#L312. @fuweid FYI.. probably should not be deferring tear down before the network bring up, or if we do should ignore teardown failure when setup fails. Same for 1.6/7

@mikebrow mikebrow added this pull request to the merge queue Sep 30, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 30, 2024
@mxpv mxpv added this pull request to the merge queue Sep 30, 2024
Merged via the queue into containerd:main with commit 03db11c Sep 30, 2024
@austinvazquez austinvazquez added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Nov 20, 2024
@MrHohn
Copy link
Copy Markdown

MrHohn commented Jul 21, 2025

Hi folks, posting here for more visibility - I just filed a bug on #12130 that is closely relevant to this change. Would appreciate your inputs there, thanks!

MrHohn added a commit to MrHohn/containerd that referenced this pull request Jul 21, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illusrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and leads to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally ignoring the teardown network when error is related to CNI
initialization, while allowing teardown retry to happen when it makes
sense.
MrHohn added a commit to MrHohn/containerd that referenced this pull request Jul 21, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally ignoring the teardown network when error is related to CNI
initialization, while allowing teardown retry to happen when it makes
sense.
MrHohn added a commit to MrHohn/containerd that referenced this pull request Jul 21, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally ignoring the teardown network when error is related to CNI
initialization, while allowing teardown retry to happen when it makes
sense.

Signed-off-by: Zihong Zheng <zihongz@google.com>
MrHohn added a commit to MrHohn/containerd that referenced this pull request Jul 25, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally ignoring the teardown network when error is related to CNI
initialization, while allowing teardown retry to happen when it makes
sense.

Signed-off-by: Zihong Zheng <zihongz@google.com>
MrHohn added a commit to MrHohn/containerd that referenced this pull request Aug 4, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally skipping the teardown network only when error is related
to CNI initialization, while allowing teardown retry to happen if setup
already allocated network resoruce partially.

Signed-off-by: Zihong Zheng <zihongz@google.com>
MrHohn added a commit to MrHohn/containerd that referenced this pull request Aug 4, 2025
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally skipping the teardown network only when error is related
to CNI initialization, while allowing teardown retry to happen if setup
already allocated network resoruce partially.

Signed-off-by: Zihong Zheng <zihongz@google.com>
mikebrow pushed a commit to MrHohn/containerd that referenced this pull request Jan 30, 2026
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally skipping the teardown network only when error is related
to CNI initialization, while allowing teardown retry to happen if setup
already allocated network resoruce partially.

Signed-off-by: Zihong Zheng <zihongz@google.com>
mikebrow pushed a commit to MrHohn/containerd that referenced this pull request Feb 3, 2026
This is a follow up of both containerd#10744 and containerd#10839, where teardownPodNetwork
may be skipped if setup networks failed in the first place.

In the situation illustrated on containerd#12130, for chained CNI pattern the
skipping logic may be problematic and can lead to network resource
leakage (e.g. Pod IP).

This PR attempts to make the sandbox network handling more robust by
conditionally skipping the teardown network only when error is related
to CNI initialization, while allowing teardown retry to happen if setup
already allocated network resoruce partially.

Signed-off-by: Zihong Zheng <zihongz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch kind/bug ok-to-test size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants