Skip to content

Conversation

@abel-von
Copy link
Contributor

@abel-von abel-von commented Jan 9, 2024

To make codes clean, CRI can store a Sandbox client in each sandbox in cache, and call the APIs in the client directly, so that we don't need to find the Controller everytime we need to call APIs of sandbox controller.

@k8s-ci-robot
Copy link

Hi @abel-von. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abel-von
Copy link
Contributor Author

abel-von commented Jan 9, 2024

needs rebase after #9463 is merged

@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from 75994f9 to c6b25ae Compare January 9, 2024 02:28
@abel-von abel-von mentioned this pull request Jan 9, 2024
19 tasks
@abel-von
Copy link
Contributor Author

abel-von commented Jan 9, 2024

@fuweid @dmcgowan @mxpv

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see questions comments :-)

}
exitCh <- *containerd.NewExitStatus(exit.ExitStatus, exit.ExitedAt, nil)
}()
// Sandbox was in running state, but its task has been deleted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready/notready/unknown could it be in unknown here after restart? should we always overwrite here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As sandbox client clearly returns NotFound, I think we can make sure that the sandbox does not exist anymore. I think we can set the state to not ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it was unknown before restart... does that mean unknown is not valid after restart.. or are we removing unknown...

I'm probably missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the status of sandbox is a little bit simpler than container, when we call ListPodSandbox, we can only get two states of whether Ready or NotReady. we only set the sandbox status to Unknown in cri cache temporary when we do RunPodSandbox and we do not store the state to db. So when containerd restart, it will check the status of sandbox by calling the Status or Wait APIs of sandbox controller.
We can also see that in codes of recover.go and restart.go that the status of sandbox is set to Ready and NotReady, without a third state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying this:
https://github.com/containerd/containerd/blob/main/pkg/cri/server/podsandbox/sandbox_run.go#L285-L288](https://github.com/containerd/containerd/blob/main/pkg/cri/store/sandbox/status.go#L76-L78)

only applies to cache.. hmm is that new and we just didn't update the state diagram?

Ok I missed the move from sandbox having a store that was actually stored to an in memory cache only..

hmm...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this change: #7401

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I be concerned here:
https://github.com/containerd/containerd/blob/main/pkg/cri/server/podsandbox/sandbox_run.go#L265-L288
about not using a lock on this cached map (writing directly to the map contents to update state from unknown to ready, and other changes, without doing some sort of lock first via an update api?)

Guessing my confusion is more to the in between status of the cri sandbox store and it's meta information and the moving of that to the new sandbox meta...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about not using a lock on this cached map (writing directly to the map contents to update state from unknown to ready, and other changes, without doing some sort of lock first via an update api?

I guess you are right, I reconstructed the code of PodSandbox about the status and add a commit in this PR, please take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take a look again. @mikebrow

@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from c6b25ae to c536aa8 Compare January 10, 2024 02:11
@abel-von abel-von requested a review from mikebrow January 10, 2024 03:05
@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from c536aa8 to 1bd9df7 Compare January 12, 2024 03:57
@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from 1bd9df7 to 70107d1 Compare January 12, 2024 03:59
@abel-von abel-von force-pushed the sandbox-plugin-0109 branch 2 times, most recently from c39a539 to 6558b99 Compare January 12, 2024 06:16
@dmcgowan
Copy link
Member

call the APIs in the client directly, so that we don't need to find the Controller everytime we need to call APIs of sandbox controller.

I'm hesitant on this one. We want to decouple the plugins from the client package rather than add more to client to cleanup plugins. It seems we are missing a service interface for controllers that should be in a plugin. The client container interface is not a good example to follow and also one we want to build a service interface around.

@abel-von
Copy link
Contributor Author

It seems we are missing a service interface for controllers that should be in a plugin.

Removed the client dependency and add some methods in the sandboxService in cri. and fixed the bug that PinnedImages has no struct tag. please take a look.

@abel-von
Copy link
Contributor Author

/cc @dmcgowan

@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from a6d9316 to 721d1fb Compare February 4, 2024 07:43
@mxpv mxpv added the status/needs-discussion Needs discussion and decision from maintainers label Feb 5, 2024
@dims dims added the area/cri Container Runtime Interface (CRI) label Feb 7, 2024
@abel-von
Copy link
Contributor Author

abel-von commented Feb 7, 2024

@dmcgowan @mxpv Do we still have to discuss this PR? as I did't use Sandbox Client anymore.

@abel-von abel-von changed the title sandbox: use Sandbox client in CRI plugin instead of controller sandbox: use sandboxService in CRI plugin instead of calling controller API directly Feb 21, 2024
@Burning1020
Copy link
Member

@mxpv Does it still need disscussion? We've disscussed at community meeting on Jan 25 before adding this label: https://docs.google.com/document/d/1Q8KyVJd26oAQ3MafbnkVBgJCopPl2Bw45H9L9br9Vus/edit#heading=h.3f10hgwx9mc

func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
sbController, ok := c.sandboxers[sandboxer]
if !ok {
return sandbox.ControllerStatus{}, fmt.Errorf("failed to get sandbox controller by %s", sandboxer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest introducing a helper so you don't have to replicate same error message in every function:

func controller(name string) {
    sbController, ok := c.sandboxers[sandboxer]
 	if !ok {
 		return nil, fmt.Errorf("sandbox controller %q not found", sandboxer)
 	}

    return sbController
}
``

And then you could just return an error:

```go
controller, err := c.controller(sandboxer)
if err != nil {
    return nil, err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an exported method named SandboxController, I was hesitant on calling it because we still need to check the error after calling, It seems we didn't reduce the code lines or complexity, So I gave up calling it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call the SandboxController.

func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
        ctrl, err := c.SandboxController(sandboxer)
        if err != nil {
              return sandbox.ControllerStatus{}, err
         }
        return ctrl.Status(ctx, sandboxID, verbose)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@mxpv mxpv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion, but overall look good to me once the CI is green.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some comments. Basically, It looks good.

assert.Equal(t, s.State, sandboxstore.StateReady.String())

sb.Exit(*containerd.NewExitStatus(exitStatus, exitedAt, nil))
if err := sb.Exit(exitStatus, exitedAt); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using assert.NoError(sb.Exit()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error {
sbController, ok := c.sandboxers[info.Sandboxer]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not related to this pull request. But just my two cents, I think Controller name is better than Sandboxer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// Initialize pod sandbox controller
// TODO: Get this from options, NOT client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to remove this TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
sbController, ok := c.sandboxers[sandboxer]
if !ok {
return sandbox.ControllerStatus{}, fmt.Errorf("failed to get sandbox controller by %s", sandboxer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call the SandboxController.

func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
        ctrl, err := c.SandboxController(sandboxer)
        if err != nil {
              return sandbox.ControllerStatus{}, err
         }
        return ctrl.Status(ctx, sandboxID, verbose)
}

assert.Equal(t, p.Status.Get().State, sandbox.StateUnknown)
assert.Equal(t, p.ID, "test")
p.Metadata = sandbox.Metadata{ID: "test", NetNSPath: "/test"}
assert.Equal(t, p.Metadata.NetNSPath, "/test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we don't need to verify the assignment here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return status, nil
})
if err != nil {
t.Fatalf("Update pod sandbox status failed %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires.NoError is equal to t.Fatalf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert.Equal(t, exitTime, exitAt)
}()
time.Sleep(time.Second)
if err := p.Exit(uint32(128), exitAt); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest sync and drain two goroutines after call p.Exit.
Otherwise, it's easy to be flaky and go-test might detect race condition and mark it failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

event := &eventtypes.TaskExit{ExitStatus: exitStatus, ExitedAt: protobuf.ToTimestamp(exitedAt)}
if cleanErr := handleSandboxTaskExit(dctx, p, event); cleanErr != nil {
if err := handleSandboxTaskExit(dctx, p, event); err != nil {
// TODO will backoff the event to the controller's own EventMonitor, but not cri's,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your comment correctly: maybe we should let CRI plugin to retry all the failures? so that we don't need to implement other retry for shim-type sandbox controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSandboxExit in CRI plugin will only update the sandbox status in its sandboxStore and should ignore the implementation details of sandbox(So it should not know that if there is a task or shim for the sandbox), I think it is the podsandbox controller's responsibility to do cleanup of the legacy sandbox. and that is what we do in handleSandboxTaskExit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the spliting of EventMonitor of podsandbox from CRI plugin is done in #9598, The retry of sandbox container task cleanup is done in that PR

so that we cri service don't have to get sandbox controller everytime it
needs to call sandbox controller api.

Signed-off-by: Abel Feng <fshb1988@gmail.com>
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Signed-off-by: Abel Feng <fshb1988@gmail.com>
@abel-von abel-von force-pushed the sandbox-plugin-0109 branch from 20c511d to a60e52f Compare February 26, 2024 02:17
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid added this pull request to the merge queue Feb 28, 2024
@fuweid fuweid removed the status/needs-discussion Needs discussion and decision from maintainers label Feb 28, 2024
@fuweid fuweid removed this pull request from the merge queue due to a manual request Feb 28, 2024
@fuweid fuweid added this pull request to the merge queue Feb 28, 2024
Merged via the queue into containerd:main with commit 2cdf012 Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) ok-to-test size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants