Skip to content

Conversation

@yylt
Copy link
Contributor

@yylt yylt commented Oct 23, 2025

Fix #12344

when executing NewTask in the podsandbox run, it may fail for various reasons, like timeout, ctx cancle .etc, and the shim process has already started.
During the defer cleanup, cleaning up the podsandbox.container only deletes the data in the store and cannot guarantee that the task is properly deleted.
Therefore, it is necessary to retain the podsandbox.container, And task can be actively cleaned up through shutdown api. But the invocation of the task depends on the container information, so keep podsandbox.container even if cleanup done.

There is no guarantee that the NewTask operation is atomic. A failure does not mean that resources have been completely cleaned up. Therefore, it is reasonable to delegate cleanup to the upper caller

Signed-off-by: yang yang <yang8518296@163.com>
@github-project-automation github-project-automation bot moved this to Needs Triage in Pull Request Review Oct 23, 2025
@dosubot dosubot bot added the area/cri Container Runtime Interface (CRI) label Oct 23, 2025
@yylt
Copy link
Contributor Author

yylt commented Oct 23, 2025

cc @fuweid @ningmingxiao

@ningmingxiao
Copy link
Contributor

can you explain how it works? @yylt

if retErr != nil && cleanupErr == nil {
deferCtx, deferCancel := ctrdutil.DeferContext()
defer deferCancel()
if cleanupErr = container.Delete(deferCtx, containerd.WithSnapshotCleanup); cleanupErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only invoked when task has been deleted. so, it's probably not that case you mentioned.
Maybe we should keep that container only if it fails to delete container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... task delete will append defer function after NewTask success.

After setting to nil upon successful deletion, may encounter the leak you described in the issue #8981, regarding the "not found" error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add failpoint test case to verify it. We probably need to check task service as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) size/XS

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

containerd-shim processes leak during high pod churn

4 participants