Skip to content

Handle teardown failure to avoid blocking cleanup#10800

Closed
matteopulcini6 wants to merge 0 commit intocontainerd:mainfrom
matteopulcini6:main
Closed

Handle teardown failure to avoid blocking cleanup#10800
matteopulcini6 wants to merge 0 commit intocontainerd:mainfrom
matteopulcini6:main

Conversation

@matteopulcini6
Copy link
Copy Markdown

When CNI isn't configured, the pod creation fails and leaves the network namespace uncleared. The cleanup process shouldn't depend on CNI being configured to properly remove the pod

Sample flow shared by @sameersaeed

pod-config.json:, per crictl docs

{
  "metadata": {
    "name": "nginx-sandbox",
    "namespace": "default",
    "attempt": 1,
    "uid": "65fb5845de97d8a4"
  },
  "log_directory": "/tmp",
  "linux": {
  }
}

Current behaviour from running sandbox pod - receive CNI plugin error, but pod still comes up:

root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
root@32748c76b2ba:/src# crictl runp pod-config.json
E1004 17:42:14.071632   31488 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"803615d3f284991954f660a4d85295c15839ba8fa574640d58dd02de6a248d80\": cni plugin not initialized"
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "803615d3f284991954f660a4d85295c15839ba8fa574640d58dd02de6a248d80": cni plugin not initialized 
root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
803615d3f2849       1 second ago        NotReady            nginx-sandbox       default             1                   (default)

Changes made

Current code (release 1.7):

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
}

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)
}

Changes were added to the above sandbox_run.go code to skip the pod network teardown if no CNI plugins are initialized. This allows for the rest of the deferred cleanup steps to run so that the CNI plugins failure does not block the pod from being fully cleaned up as expected:

// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
	log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)

	// ignoring failed to destroy networks when we failed to setup networks
	if sandbox.CNIResult == nil {
		cleanupErr = nil
	}
}
// Teardown network if an error is returned.
if cleanupErr = c.teardownPodNetwork(deferCtx, sandbox); cleanupErr != nil {
	log.G(ctx).WithError(cleanupErr).Errorf("Failed to destroy network for sandbox %q", id)

	// ignoring failed to destroy networks when we failed to setup networks
	if sandbox.CNIResult == nil {
		cleanupErr = nil
	}
}

This makes it so that a pod cannot be started if no CNI plugins are initialized:

root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
root@32748c76b2ba:/src# crictl runp pod-config.json
E1004 17:40:26.373297   29655 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"bd9a6545dc2f9f1484b36a3b72b2c87d2e44b9f74fe5ffc90f53fa220d73462b\": cni plugin not initialized"
FATA[0000] run pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bd9a6545dc2f9f1484b36a3b72b2c87d2e44b9f74fe5ffc90f53fa220d73462b": cni plugin not initialized 
root@32748c76b2ba:/src# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME

@k8s-ci-robot
Copy link
Copy Markdown

Hi @matteopulcini6. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Oct 8, 2024

/ok-to-test

@matteopulcini6
Copy link
Copy Markdown
Author

@dmcgowan does the below error need to be fixed before pr is merged?

[FAIL] [k8s.io] Container OOM runtime should output OOMKilled reason [It] should terminate with exitCode 137 and reason OOMKilled

Link to that is here

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) kind/bug ok-to-test size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants