-
Notifications
You must be signed in to change notification settings - Fork 3.8k
sandbox: use sandboxService in CRI plugin instead of calling controller API directly #9617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @abel-von. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
needs rebase after #9463 is merged |
75994f9 to
c6b25ae
Compare
mikebrow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see questions comments :-)
pkg/cri/server/restart.go
Outdated
| } | ||
| exitCh <- *containerd.NewExitStatus(exit.ExitStatus, exit.ExitedAt, nil) | ||
| }() | ||
| // Sandbox was in running state, but its task has been deleted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready/notready/unknown could it be in unknown here after restart? should we always overwrite here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As sandbox client clearly returns NotFound, I think we can make sure that the sandbox does not exist anymore. I think we can set the state to not ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it was unknown before restart... does that mean unknown is not valid after restart.. or are we removing unknown...
I'm probably missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the status of sandbox is a little bit simpler than container, when we call ListPodSandbox, we can only get two states of whether Ready or NotReady. we only set the sandbox status to Unknown in cri cache temporary when we do RunPodSandbox and we do not store the state to db. So when containerd restart, it will check the status of sandbox by calling the Status or Wait APIs of sandbox controller.
We can also see that in codes of recover.go and restart.go that the status of sandbox is set to Ready and NotReady, without a third state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only applies to cache.. hmm is that new and we just didn't update the state diagram?
Ok I missed the move from sandbox having a store that was actually stored to an in memory cache only..
hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed this change: #7401
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I be concerned here:
https://github.com/containerd/containerd/blob/main/pkg/cri/server/podsandbox/sandbox_run.go#L265-L288
about not using a lock on this cached map (writing directly to the map contents to update state from unknown to ready, and other changes, without doing some sort of lock first via an update api?)
Guessing my confusion is more to the in between status of the cri sandbox store and it's meta information and the moving of that to the new sandbox meta...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about not using a lock on this cached map (writing directly to the map contents to update state from unknown to ready, and other changes, without doing some sort of lock first via an update api?
I guess you are right, I reconstructed the code of PodSandbox about the status and add a commit in this PR, please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take a look again. @mikebrow
c6b25ae to
c536aa8
Compare
c536aa8 to
1bd9df7
Compare
1bd9df7 to
70107d1
Compare
c39a539 to
6558b99
Compare
I'm hesitant on this one. We want to decouple the plugins from the |
6558b99 to
e968bf6
Compare
e968bf6 to
a6d9316
Compare
Removed the client dependency and add some methods in the |
|
/cc @dmcgowan |
a6d9316 to
721d1fb
Compare
|
@mxpv Does it still need disscussion? We've disscussed at community meeting on Jan 25 before adding this label: https://docs.google.com/document/d/1Q8KyVJd26oAQ3MafbnkVBgJCopPl2Bw45H9L9br9Vus/edit#heading=h.3f10hgwx9mc |
| func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) { | ||
| sbController, ok := c.sandboxers[sandboxer] | ||
| if !ok { | ||
| return sandbox.ControllerStatus{}, fmt.Errorf("failed to get sandbox controller by %s", sandboxer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest introducing a helper so you don't have to replicate same error message in every function:
func controller(name string) {
sbController, ok := c.sandboxers[sandboxer]
if !ok {
return nil, fmt.Errorf("sandbox controller %q not found", sandboxer)
}
return sbController
}
``
And then you could just return an error:
```go
controller, err := c.controller(sandboxer)
if err != nil {
return nil, err
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already an exported method named SandboxController, I was hesitant on calling it because we still need to check the error after calling, It seems we didn't reduce the code lines or complexity, So I gave up calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call the SandboxController.
func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
ctrl, err := c.SandboxController(sandboxer)
if err != nil {
return sandbox.ControllerStatus{}, err
}
return ctrl.Status(ctx, sandboxID, verbose)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
mxpv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion, but overall look good to me once the CI is green.
721d1fb to
a576130
Compare
a576130 to
20c511d
Compare
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave some comments. Basically, It looks good.
| assert.Equal(t, s.State, sandboxstore.StateReady.String()) | ||
|
|
||
| sb.Exit(*containerd.NewExitStatus(exitStatus, exitedAt, nil)) | ||
| if err := sb.Exit(exitStatus, exitedAt); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest using assert.NoError(sb.Exit()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
|
|
||
| func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error { | ||
| sbController, ok := c.sandboxers[info.Sandboxer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to this pull request. But just my two cents, I think Controller name is better than Sandboxer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
internal/cri/server/service.go
Outdated
| } | ||
|
|
||
| // Initialize pod sandbox controller | ||
| // TODO: Get this from options, NOT client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider to remove this TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) { | ||
| sbController, ok := c.sandboxers[sandboxer] | ||
| if !ok { | ||
| return sandbox.ControllerStatus{}, fmt.Errorf("failed to get sandbox controller by %s", sandboxer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call the SandboxController.
func (c *criSandboxService) SandboxStatus(ctx context.Context, sandboxer string, sandboxID string, verbose bool) (sandbox.ControllerStatus, error) {
ctrl, err := c.SandboxController(sandboxer)
if err != nil {
return sandbox.ControllerStatus{}, err
}
return ctrl.Status(ctx, sandboxID, verbose)
}| assert.Equal(t, p.Status.Get().State, sandbox.StateUnknown) | ||
| assert.Equal(t, p.ID, "test") | ||
| p.Metadata = sandbox.Metadata{ID: "test", NetNSPath: "/test"} | ||
| assert.Equal(t, p.Metadata.NetNSPath, "/test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we don't need to verify the assignment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| return status, nil | ||
| }) | ||
| if err != nil { | ||
| t.Fatalf("Update pod sandbox status failed %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires.NoError is equal to t.Fatalf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| assert.Equal(t, exitTime, exitAt) | ||
| }() | ||
| time.Sleep(time.Second) | ||
| if err := p.Exit(uint32(128), exitAt); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest sync and drain two goroutines after call p.Exit.
Otherwise, it's easy to be flaky and go-test might detect race condition and mark it failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| event := &eventtypes.TaskExit{ExitStatus: exitStatus, ExitedAt: protobuf.ToTimestamp(exitedAt)} | ||
| if cleanErr := handleSandboxTaskExit(dctx, p, event); cleanErr != nil { | ||
| if err := handleSandboxTaskExit(dctx, p, event); err != nil { | ||
| // TODO will backoff the event to the controller's own EventMonitor, but not cri's, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your comment correctly: maybe we should let CRI plugin to retry all the failures? so that we don't need to implement other retry for shim-type sandbox controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleSandboxExit in CRI plugin will only update the sandbox status in its sandboxStore and should ignore the implementation details of sandbox(So it should not know that if there is a task or shim for the sandbox), I think it is the podsandbox controller's responsibility to do cleanup of the legacy sandbox. and that is what we do in handleSandboxTaskExit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the spliting of EventMonitor of podsandbox from CRI plugin is done in #9598, The retry of sandbox container task cleanup is done in that PR
so that we cri service don't have to get sandbox controller everytime it needs to call sandbox controller api. Signed-off-by: Abel Feng <fshb1988@gmail.com>
Signed-off-by: Abel Feng <fshb1988@gmail.com>
Signed-off-by: Abel Feng <fshb1988@gmail.com>
20c511d to
a60e52f
Compare
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
To make codes clean, CRI can store a Sandbox client in each sandbox in cache, and call the APIs in the client directly, so that we don't need to find the Controller everytime we need to call APIs of sandbox controller.