Skip to content

Conversation

@Einsfier
Copy link
Contributor

@Einsfier Einsfier commented Jun 24, 2021

This PR aims to fix #5597 , but also introduces some new utils.

I have separated it into several commits to make review easier.

  • 3f42de4
    • Fix ContextDialer implementation, now it's a real context-aware dialer, previously it was simply implemented using context deadline which has no response to context cancellation.
    • ContextDialer has the exactly same behavior as before, which means it tolerates ENOENT when context has a deadline.
    • Unify timeoutDialer implementation, now it uses a context with timeout to simulate timeout setting.
  • 467d717
    • Add a new util func ContextDialerFunc, allowing customization to dial error handling. The dialer will keep retrying if the error is tolerable. It's caller's responsibility to cancel or set a deadline for the dialer.
    • Do not tolerate ENOENT while reconnecting shim socket. This is to avoid stuck at containerd startup when shim sockets are unexpected missing.

This PR also fix #5142

@k8s-ci-robot
Copy link

Hi @povsister. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Einsfier
Copy link
Contributor Author

Oh, good. 🤣
Seems that I need to remove net.ErrClosed to make it work on prior Go 1.16 releases.

Einsfier added 2 commits June 24, 2021 15:47
Unify timeout dialer and context dialer implementation.
Fix ContextDialer to use a real context-aware dialer.

Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
Containerd may stuck for hours at startup while reconnecting shim sockets which are unexpected missing.
This fix adds a new ContextDialerFunc to allow you customize error handler during the dialing.

Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
@Einsfier Einsfier force-pushed the fix-shim-reconnect branch from bd35dcc to 467d717 Compare June 24, 2021 07:50
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jun 24, 2021

Build succeeded.

1 similar comment
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jun 24, 2021

Build succeeded.

@Einsfier
Copy link
Contributor Author

@fuweid All done, PTAL.

@thaJeztah
Copy link
Member

/ok-to-test

@Einsfier Einsfier force-pushed the fix-shim-reconnect branch 2 times, most recently from 91a873d to 7269405 Compare June 24, 2021 08:31
Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
@Einsfier Einsfier force-pushed the fix-shim-reconnect branch from 7269405 to 4d19774 Compare June 24, 2021 08:44
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jun 24, 2021

Build succeeded.

return AnonDialer(address, timeout)
ctx, cancel := context.WithTimeout(context.TODO(), timeout)
defer cancel()
return dialer.ContextDialerFunc(ctx, address, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use DialTimeout directly, there is no meaning to unify dialer. Also, the dialer implementation is different between *unix and windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that we can unify the AnonDialer and AnonReconnectDialer to call ContextDialer, then they can be removed from util_unix.go and util_windows.go
Shall I add another commit to unify the Dialer implementation on windwos ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to keep them separated. ContextDialer looks like for grpc call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to keep them separated. ContextDialer looks like for grpc call.

I mean we can use the ContextDialer to unify the AnonDialer and AnonReconnectDialer 's implementation in util on different platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't think we should align containerd client's dialer with shim's. DRY is good principle. But in this case, the ContextDialer looks like complicated than before and the AnonDialer should be simple.

I am not windows developer so I am not sure we can remove the AnonDialer from different platforms. Just my two cents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the shim, there is no necessary to use ContextDialer which bring one more goroutines to handle that error.
The DialTimeout is more clear than ContextDialer in this case currently.

Since there are more issues reporting this case, I would like to fix that case and then doing that refactor(unify or something).

cc @containerd/reviewers @containerd/maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.. which then uses a new background context

We should probably be calling net.DialContext instead but we'd have to go review the history of this to see why that didn't happen..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.

The reason I guess is that DialContext func has a receiver and it can not be directly used by simply importing net package, maybe the maintainer didn't notice it and invented the wheels by himself.
Using DialContext could save us lots of effort, I will update this PR later, trying to adapt DialContext.

@fuweid
Copy link
Member

fuweid commented Jun 24, 2021

also fix #5142

I don't think it can handle this case. The 5142 needs to handle SIGTERM in runc-shim-v1/v2.

@Einsfier
Copy link
Contributor Author

I don't think it can handle this case. The 5142 needs to handle SIGTERM in runc-shim-v1/v2.

From the log provided in 5142, it is confirmed that the slow-start issues causes systemd to consider containerd service failed to start after 15min. I have a detailed explanation about how Type=notify affects service health check and how it leads to service startup failure in the #5142 (comment)

Now the shimv2 simply ignores SIGTERM, so it will keep running while containerd daemon is being restarted/stopped by systemd.

When host is rebooted, those shims finally got SIGKILLED, I guess that @joedborg also explicitly configured containerd state to a persistent disk, so containerd will also get the slow-cleanup for dead shims. This is what exactly this PR tries to fix.

@fuweid
Copy link
Member

fuweid commented Jun 26, 2021

In the terminating stage (during shutdown) , the systemd might send the TERM first. If there is any process which ignores the signal, it will take time to shutdown. The shimv2-runc is the one process.

And then, the dead shim lefts the bundle in the persist disk and the containerd will take more time to recover it, which can be solved by this pr.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments/questions

)

// ContextDialer returns a GRPC net.Conn connected to the provided address.
// It tolerates ENOENT and keeps retrying if provided context has a deadline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this commit this function now passes along the context it receives..

I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline. Are we expecting that case to improve the speed in scenarios where we take an ENOENT? Or is the performance issue just the default dial timeout is too long causing the possibility of an ENOENT and thus a retry would reset the very long default again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we should always tolerate ENOENT otherwise we'll get spurious failures modes?

Copy link
Contributor Author

@Einsfier Einsfier Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline.

I explicitly add a hint saying "... if provided context has a deadline." because the original code does so, but the author didn't give any notification or warning about it.
Actually, To avoid any unexpected failure or inconsistent behavior after this patch, I just rewrite this function to make it "more context-aware" without introducing any observable change from caller's view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok... not sure if the two additional lines of descriptive text for the function help or not

synC = make(chan *dialResult)
)
// timeoutDialer connects to the provided address with a timeout.
func timeoutDialer(address string, timeout time.Duration) (conn net.Conn, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dialer/timeoutDialer was deprecated in favor of forcing the caller to set context. In the containerd/containerd repo we only use dialer.Dialer a few times. Perhaps we should stop using Dialer in those and just go ahead and recode to use proper context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However.. it's exported and located under pkg, no one knows if there are any 3rd party depends on it. If any member from containerd determines to completely remove the deprecated Dialer, I am glad to help remove any existing usage in the project scope.

Copy link
Member

@crosbymichael crosbymichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to review this one to see if there is an actual issue with the handling. I don't support the statedir not being on tmpfs. That is an invalid configuration and we shouldn't jump though hoops in the code to support this.

@mupasak
Copy link

mupasak commented Aug 9, 2021

I have tried changing statedir to be on tmpfs and it seems to work as expected. containerd boot up quickly as soon as the node gets rebooted without the ongoing patchset, though I could not see anything clearly mentioned in any document that mandates us to use tmpfs as state dir. It would be better if we add it to the document .

@fuweid
Copy link
Member

fuweid commented Aug 14, 2021

I would like to fix the issue first and then refactor. any thoughts?

@mikebrow @dmcgowan @crosbymichael

@dmcgowan dmcgowan closed this in f7658e3 Oct 7, 2021
fahedouch pushed a commit to fahedouch/containerd that referenced this pull request Oct 15, 2021
In linux platform, the shim server always listens on the socket before
the containerd task manager dial it. It is unlikely that containerd task
manager should handle reconnect because the shim can't restart. For this
case, the containerd task manager should fail fast if there is ENOENT or
ECONNREFUSED error.

And if the socket file is deleted during cleanup the exited task, it
maybe cause that containerd task manager takes long time to reload the
dead shim. For that task.v2 manager, the race case is like:

```
TaskService.Delete
  TaskManager.Delete(runtime/v2/manager.go)
    shim.delete(runtime/v2/shim.go)
      shimv2api.Shutdown(runtime/v2/task/shim.pb.go)

      <- containerd has been killed or restarted somehow

      bundle.Delete
```

The shimv2api.Shutdown will cause that the shim deletes socket file
(containerd-shim-runc-v2 does). But the bundle is still there. During
reloading, the containerd will wait for the socket file appears again
in 100 seconds. It is not reasonable. The Reconnect should prevent this
case by fast fail.

Closes: containerd#5648.
Fixes: containerd#5597.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Jan 11, 2023
In linux platform, the shim server always listens on the socket before
the containerd task manager dial it. It is unlikely that containerd task
manager should handle reconnect because the shim can't restart. For this
case, the containerd task manager should fail fast if there is ENOENT or
ECONNREFUSED error.

And if the socket file is deleted during cleanup the exited task, it
maybe cause that containerd task manager takes long time to reload the
dead shim. For that task.v2 manager, the race case is like:

```
TaskService.Delete
  TaskManager.Delete(runtime/v2/manager.go)
    shim.delete(runtime/v2/shim.go)
      shimv2api.Shutdown(runtime/v2/task/shim.pb.go)

      <- containerd has been killed or restarted somehow

      bundle.Delete
```

The shimv2api.Shutdown will cause that the shim deletes socket file
(containerd-shim-runc-v2 does). But the bundle is still there. During
reloading, the containerd will wait for the socket file appears again
in 100 seconds. It is not reasonable. The Reconnect should prevent this
case by fast fail.

Closes: containerd#5648.
Fixes: containerd#5597.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit f7658e3)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Containerd takes hours to startup after node reboot Systemd unit fails to start when Type=notify and host is rebooted

7 participants