Skip to content

Conversation

@abel-von
Copy link
Contributor

@abel-von abel-von commented Dec 4, 2023

For decoupling of podsandbox controller to cri service, podsandbox controller should create its own cache of PodSandbox object.

@k8s-ci-robot
Copy link

Hi @abel-von. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@akhilerm
Copy link
Member

akhilerm commented Dec 4, 2023

/ok-to-test

Stop(ctx context.Context, sandboxID string, opts ...StopOpt) error
// Wait blocks until sandbox process exits.
Wait(ctx context.Context, sandboxID string) (ExitStatus, error)
Wait(ctx context.Context, sandboxID string) (<-chan ExitStatus, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you change the definition here. You should change the comment as well.
Would you please add more comment about why it becomes async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil
}

func (s *Store) Get(id string) (*types.PodSandbox, bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool seems duplicate information here, since that Save ensures the p is not nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, the existing is based on whether the return value is nil

ch := make(chan sandbox.ExitStatus, 1)
go func() {
// because we don't set timeout of Wait, so it will not return err
status, _ := podSandbox.Wait(ctrdutil.NamespacedContext())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use ctx here. The caller should ensure that context should not share the same scope with RunPodSandbox one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@abel-von abel-von force-pushed the sandbox-plugin-1205 branch from 6208f4d to b7bd664 Compare December 9, 2023 15:07
@abel-von
Copy link
Contributor Author

@fuweid Could you please take a look again? Could @mxpv take a look please?

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change is too large. It seems it mixed bugfix and refactor.
If possible, please separate changes into several pull requests.

// Stop will stop sandbox instance
Stop(ctx context.Context, sandboxID string, opts ...StopOpt) error
// Wait blocks until sandbox process exits.
// Wait asynchronously waits for the sandbox to exit, and sends the exit code to the returned channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the change.

})

if err != nil {
return sandbox.ExitStatus{}, fmt.Errorf("failed to wait sandbox %s: %w", sandboxID, errdefs.FromGRPC(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to change this line?

// In most cases, controller.Status() with verbose=true should have SandboxInfo in the return,
// but if controller.Status() returns a NotFound error,
// we should fallback to get SandboxInfo from cached sandbox itself.
func toCRISandboxInfo(sandbox sandboxstore.Sandbox) (map[string]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please change this name to, like toDeletedCRISandboxInfo? toCRISandboxInfo looks weird because it's just used when sandbox is not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

exitCh, err := sandboxClient.Wait(util.NamespacedContext())
if err != nil {
log.G(ctx).WithError(err).Error("failed to wait for sandbox exit")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not correct. Please return err here. The named return error is retErr instead of err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this the next PR as I reverted commit of sandboxClient.

}

// Convert the old pause Container to the new Sandbox
info, _ := sb.Container.Info(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if err is not nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the commit is reverted, will consider this in the next PR


for _, sb := range c.sandboxStore.List() {
sb := sb
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it bugfix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be considered as a bugfix, before this PR, the recover.go in podsandbox will handle the task exit and update the sandbox in the cri cache. Now as in this pr we removed the sandboxStore in podsandbox controller, it can not update the sandbox in cri cache anymore. so we have to call Wait of sandbox and startSandboxExitMonitor here.
I think it is also neccessary if we have more sandbox controllers other than podsandbox. we have to call Wait of sandbox and monitor it by ourselves.

if err != nil {
if !errdefs.IsNotFound(err) {
return fmt.Errorf("failed to get sandbox container: %w", err)
return fmt.Errorf("failed to get podSandbox container: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/podSandbox/pod sandbox/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

sandboxID, err)
podSandbox := c.store.Get(sandboxID)
if podSandbox == nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider return ErrNotFound here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, and let cri server log and ignore the ErrNotFound.

func (s *sandboxClient) Wait(ctx context.Context) (<-chan api.ExitStatus, error) {
ch := make(chan api.ExitStatus, 1)
go func() {
status, err := s.client.SandboxController(s.metadata.Sandboxer).Wait(ctx, s.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to close ch in defer and remain that UnknownExitStatus if Wait returns error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@abel-von abel-von force-pushed the sandbox-plugin-1205 branch from 7f4589d to 0daa015 Compare January 4, 2024 16:09
@abel-von abel-von force-pushed the sandbox-plugin-1205 branch 2 times, most recently from 8e06d18 to 2b18128 Compare January 4, 2024 16:48
@abel-von
Copy link
Contributor Author

abel-von commented Jan 4, 2024

If possible, please separate changes into several pull requests.

ok, I revert the pr of refator to use Sandbox Client. and will submit another PR with that refactor.

@abel-von abel-von requested a review from fuweid January 4, 2024 17:01
}
}

// toCRISandboxInfo converts cached sandbox to CRI sandbox status response info map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align name with function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case <-sandbox.Stopped():
return nil
}
func (c *Controller) waitSandboxStop(ctx context.Context, sandbox *types.PodSandbox) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this function and call sandbox.Wait(ctx) directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// TODO: how to backoff
c.cri.BackOffEvent(id, e)
}
podSandbox.Exit(*containerd.NewExitStatus(exitStatus, exitedAt, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct. Line 107 has backoff event. And then the code still goes into set Exit.
And the Exit has close the channel. I think it might cause leak issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the codes of backoff event here because the startSandboxExitMonitor is responsible to handle backoff when sandbox exit event handle failed. So I removed the backoff logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to call podSandbox.Exit when the error is not a context error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to wait #9598 is merged, and then the backoff event can be handled again by the podsandbox controller.

podSandbox := types.NewPodSandbox(info.ID, sandboxstore.Status{State: sandboxstore.StateUnknown})
podSandbox.Metadata = metadata
podSandbox.Runtime = info.Runtime
_ = c.store.Save(podSandbox)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return c.store.Save(podSandbox)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}()
}

_ = c.store.Save(podSandbox)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only return error when podSandbox is nil and as we know it is not nil, so it will not return error. anyway, I add the check error logic in case that Save function returns some other error in the future.

@abel-von abel-von force-pushed the sandbox-plugin-1205 branch 2 times, most recently from a8a35f1 to 7daa93e Compare January 5, 2024 09:20
@abel-von
Copy link
Contributor Author

abel-von commented Jan 8, 2024

/retest-required

@abel-von abel-von requested a review from fuweid January 8, 2024 07:35
@abel-von abel-von force-pushed the sandbox-plugin-1205 branch 2 times, most recently from 21741d6 to ec695af Compare January 8, 2024 11:33
dctx, dcancel := context.WithTimeout(dctx, handleEventTimeout)
defer dcancel()
event := &eventtypes.TaskExit{ExitStatus: exitStatus, ExitedAt: protobuf.ToTimestamp(exitedAt)}
if err := handleSandboxTaskExit(dctx, p, event); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it going to shadow named return err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is for purpose because we don't want to return the handle task error here, anyway, I changed the name here from err to cleanErr, so to make this clear.

Signed-off-by: Abel Feng <fshb1988@gmail.com>
@abel-von abel-von force-pushed the sandbox-plugin-1205 branch from ec695af to 7dadd5f Compare January 8, 2024 16:19
@abel-von abel-von requested a review from fuweid January 9, 2024 01:31
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dctx, dcancel := context.WithTimeout(dctx, handleEventTimeout)
defer dcancel()
event := &eventtypes.TaskExit{ExitStatus: exitStatus, ExitedAt: protobuf.ToTimestamp(exitedAt)}
if cleanErr := handleSandboxTaskExit(dctx, p, event); cleanErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some test cases to see what if fail to cleanup task and more than 2 StopPodSandbox requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-O ^^^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely decoupled now... still would like to see extra test cases, this is one of the touchiest areas of the code base

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will add some more tests here

@fuweid fuweid added this pull request to the merge queue Jan 9, 2024
Merged via the queue into containerd:main with commit 9f8b845 Jan 9, 2024
if taskStatus.Status == containerd.Running {
// Wait for the task for sandbox monitor.
// wait is a long running background request, no timeout needed.
status.State = sandboxstore.StateReady
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we setting ready if there is still a window for failure to wait for the container task (see return below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abel-von ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I think you are right, I have to set the status to NotReady before return, I will submit another PR to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants