Do not tolerate ENOENT while reconnecting shim socket #5648

Einsfier · 2021-06-24T07:26:48Z

This PR aims to fix #5597 , but also introduces some new utils.

I have separated it into several commits to make review easier.

3f42de4
- Fix ContextDialer implementation, now it's a real context-aware dialer, previously it was simply implemented using context deadline which has no response to context cancellation.
- ContextDialer has the exactly same behavior as before, which means it tolerates ENOENT when context has a deadline.
- Unify timeoutDialer implementation, now it uses a context with timeout to simulate timeout setting.
467d717
- Add a new util func ContextDialerFunc, allowing customization to dial error handling. The dialer will keep retrying if the error is tolerable. It's caller's responsibility to cancel or set a deadline for the dialer.
- Do not tolerate ENOENT while reconnecting shim socket. This is to avoid stuck at containerd startup when shim sockets are unexpected missing.

This PR also fix #5142

k8s-ci-robot · 2021-06-24T07:26:56Z

Hi @povsister. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Einsfier · 2021-06-24T07:36:49Z

Oh, good. 🤣
Seems that I need to remove net.ErrClosed to make it work on prior Go 1.16 releases.

Unify timeout dialer and context dialer implementation. Fix ContextDialer to use a real context-aware dialer. Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>

Containerd may stuck for hours at startup while reconnecting shim sockets which are unexpected missing. This fix adds a new ContextDialerFunc to allow you customize error handler during the dialing. Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>

theopenlab-ci · 2021-06-24T07:51:23Z

Build succeeded.

containerd-build-arm64 : NODE_FAILURE in 0s (non-voting)
containerd-test-arm64 : NODE_FAILURE in 0s (non-voting)
containerd-integration-test-arm64 : NODE_FAILURE in 0s (non-voting)

theopenlab-ci · 2021-06-24T07:57:57Z

Build succeeded.

containerd-build-arm64 : NODE_FAILURE in 0s (non-voting)
containerd-test-arm64 : NODE_FAILURE in 0s (non-voting)
containerd-integration-test-arm64 : NODE_FAILURE in 0s (non-voting)

Einsfier · 2021-06-24T08:00:56Z

@fuweid All done, PTAL.

thaJeztah · 2021-06-24T08:01:23Z

/ok-to-test

Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>

theopenlab-ci · 2021-06-24T09:08:15Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 52s (non-voting)
containerd-test-arm64 : SUCCESS in 5m 32s (non-voting)
containerd-integration-test-arm64 : FAILURE in 21m 40s (non-voting)

fuweid · 2021-06-24T14:22:10Z

runtime/v2/shim/util_unix.go

-	return AnonDialer(address, timeout)
+	ctx, cancel := context.WithTimeout(context.TODO(), timeout)
+	defer cancel()
+	return dialer.ContextDialerFunc(ctx, address, nil)


why not just use https://golang.org/pkg/net/#DialTimeout ?

If we use DialTimeout directly, there is no meaning to unify dialer. Also, the dialer implementation is different between *unix and windows.

I just found that we can unify the AnonDialer and AnonReconnectDialer to call ContextDialer, then they can be removed from util_unix.go and util_windows.go
Shall I add another commit to unify the Dialer implementation on windwos ?

I think it is fine to keep them separated. ContextDialer looks like for grpc call.

I think it is fine to keep them separated. ContextDialer looks like for grpc call.

I mean we can use the ContextDialer to unify the AnonDialer and AnonReconnectDialer 's implementation in util on different platforms.

Well, I don't think we should align containerd client's dialer with shim's. DRY is good principle. But in this case, the ContextDialer looks like complicated than before and the AnonDialer should be simple.

I am not windows developer so I am not sure we can remove the AnonDialer from different platforms. Just my two cents.

For the shim, there is no necessary to use ContextDialer which bring one more goroutines to handle that error.
The DialTimeout is more clear than ContextDialer in this case currently.

Since there are more issues reporting this case, I would like to fix that case and then doing that refactor(unify or something).

cc @containerd/reviewers @containerd/maintainers

I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.. which then uses a new background context

We should probably be calling net.DialContext instead but we'd have to go review the history of this to see why that didn't happen..?

I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.

The reason I guess is that DialContext func has a receiver and it can not be directly used by simply importing net package, maybe the maintainer didn't notice it and invented the wheels by himself.
Using DialContext could save us lots of effort, I will update this PR later, trying to adapt DialContext.

fuweid · 2021-06-24T14:23:12Z

also fix #5142

I don't think it can handle this case. The 5142 needs to handle SIGTERM in runc-shim-v1/v2.

Einsfier · 2021-06-24T14:39:23Z

I don't think it can handle this case. The 5142 needs to handle SIGTERM in runc-shim-v1/v2.

From the log provided in 5142, it is confirmed that the slow-start issues causes systemd to consider containerd service failed to start after 15min. I have a detailed explanation about how Type=notify affects service health check and how it leads to service startup failure in the #5142 (comment)

Now the shimv2 simply ignores SIGTERM, so it will keep running while containerd daemon is being restarted/stopped by systemd.

When host is rebooted, those shims finally got SIGKILLED, I guess that @joedborg also explicitly configured containerd state to a persistent disk, so containerd will also get the slow-cleanup for dead shims. This is what exactly this PR tries to fix.

fuweid · 2021-06-26T15:50:02Z

In the terminating stage (during shutdown) , the systemd might send the TERM first. If there is any process which ignores the signal, it will take time to shutdown. The shimv2-runc is the one process.

And then, the dead shim lefts the bundle in the persist disk and the containerd will take more time to recover it, which can be solved by this pr.

mikebrow

see comments/questions

mikebrow · 2021-07-20T17:28:03Z

pkg/dialer/dialer.go

+)
+
+// ContextDialer returns a GRPC net.Conn connected to the provided address.
+// It tolerates ENOENT and keeps retrying if provided context has a deadline.


with this commit this function now passes along the context it receives..

I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline. Are we expecting that case to improve the speed in scenarios where we take an ENOENT? Or is the performance issue just the default dial timeout is too long causing the possibility of an ENOENT and thus a retry would reset the very long default again?

I suspect we should always tolerate ENOENT otherwise we'll get spurious failures modes?

I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline.

I explicitly add a hint saying "... if provided context has a deadline." because the original code does so, but the author didn't give any notification or warning about it.
Actually, To avoid any unexpected failure or inconsistent behavior after this patch, I just rewrite this function to make it "more context-aware" without introducing any observable change from caller's view.

ok... not sure if the two additional lines of descriptive text for the function help or not

pkg/dialer/dialer.go

mikebrow · 2021-07-20T19:58:45Z

pkg/dialer/dialer.go

-		synC  = make(chan *dialResult)
-	)
+// timeoutDialer connects to the provided address with a timeout.
+func timeoutDialer(address string, timeout time.Duration) (conn net.Conn, err error) {


Dialer/timeoutDialer was deprecated in favor of forcing the caller to set context. In the containerd/containerd repo we only use dialer.Dialer a few times. Perhaps we should stop using Dialer in those and just go ahead and recode to use proper context?

However.. it's exported and located under pkg, no one knows if there are any 3rd party depends on it. If any member from containerd determines to completely remove the deprecated Dialer, I am glad to help remove any existing usage in the project scope.

crosbymichael

I want to review this one to see if there is an actual issue with the handling. I don't support the statedir not being on tmpfs. That is an invalid configuration and we shouldn't jump though hoops in the code to support this.

mupasak · 2021-08-09T07:08:19Z

I have tried changing statedir to be on tmpfs and it seems to work as expected. containerd boot up quickly as soon as the node gets rebooted without the ongoing patchset, though I could not see anything clearly mentioned in any document that mandates us to use tmpfs as state dir. It would be better if we add it to the document .

fuweid · 2021-08-14T04:14:50Z

I would like to fix the issue first and then refactor. any thoughts?

@mikebrow @dmcgowan @crosbymichael

In linux platform, the shim server always listens on the socket before the containerd task manager dial it. It is unlikely that containerd task manager should handle reconnect because the shim can't restart. For this case, the containerd task manager should fail fast if there is ENOENT or ECONNREFUSED error. And if the socket file is deleted during cleanup the exited task, it maybe cause that containerd task manager takes long time to reload the dead shim. For that task.v2 manager, the race case is like: ``` TaskService.Delete TaskManager.Delete(runtime/v2/manager.go) shim.delete(runtime/v2/shim.go) shimv2api.Shutdown(runtime/v2/task/shim.pb.go) <- containerd has been killed or restarted somehow bundle.Delete ``` The shimv2api.Shutdown will cause that the shim deletes socket file (containerd-shim-runc-v2 does). But the bundle is still there. During reloading, the containerd will wait for the socket file appears again in 100 seconds. It is not reasonable. The Reconnect should prevent this case by fast fail. Closes: containerd#5648. Fixes: containerd#5597. Signed-off-by: Wei Fu <fuweid89@gmail.com>

In linux platform, the shim server always listens on the socket before the containerd task manager dial it. It is unlikely that containerd task manager should handle reconnect because the shim can't restart. For this case, the containerd task manager should fail fast if there is ENOENT or ECONNREFUSED error. And if the socket file is deleted during cleanup the exited task, it maybe cause that containerd task manager takes long time to reload the dead shim. For that task.v2 manager, the race case is like: ``` TaskService.Delete TaskManager.Delete(runtime/v2/manager.go) shim.delete(runtime/v2/shim.go) shimv2api.Shutdown(runtime/v2/task/shim.pb.go) <- containerd has been killed or restarted somehow bundle.Delete ``` The shimv2api.Shutdown will cause that the shim deletes socket file (containerd-shim-runc-v2 does). But the bundle is still there. During reloading, the containerd will wait for the socket file appears again in 100 seconds. It is not reasonable. The Reconnect should prevent this case by fast fail. Closes: containerd#5648. Fixes: containerd#5597. Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit f7658e3) Signed-off-by: Wei Fu <fuweid89@gmail.com>

k8s-ci-robot added the needs-ok-to-test label Jun 24, 2021

Einsfier added 2 commits June 24, 2021 15:47

Refactor dialer

3f42de4

Unify timeout dialer and context dialer implementation. Fix ContextDialer to use a real context-aware dialer. Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>

Einsfier force-pushed the fix-shim-reconnect branch from bd35dcc to 467d717 Compare June 24, 2021 07:50

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 24, 2021

Einsfier force-pushed the fix-shim-reconnect branch 2 times, most recently from 91a873d to 7269405 Compare June 24, 2021 08:31

Make govet happy and fix flaky test

4d19774

Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>

Einsfier force-pushed the fix-shim-reconnect branch from 7269405 to 4d19774 Compare June 24, 2021 08:44

fuweid reviewed Jun 24, 2021

View reviewed changes

fuweid added cherry-pick/1.4.x labels Jun 26, 2021

gcapizzi mentioned this pull request Jul 5, 2021

Garden does not recover after a machine restart cloudfoundry/garden-runc-release#195

Closed

mikebrow reviewed Jul 20, 2021

View reviewed changes

crosbymichael requested changes Jul 30, 2021

View reviewed changes

fuweid mentioned this pull request Sep 20, 2021

runtime: should fail fast if dial error on shim #6031

Merged

dmcgowan closed this in f7658e3 Oct 7, 2021

fuweid mentioned this pull request Jan 11, 2023

[release/1.5] runtime: should fail fast if dial error on shim #7953

Merged

Do not tolerate ENOENT while reconnecting shim socket #5648

Do not tolerate ENOENT while reconnecting shim socket #5648

Uh oh!

Conversation

Einsfier commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 24, 2021

Uh oh!

Einsfier commented Jun 24, 2021

Uh oh!

theopenlab-ci bot commented Jun 24, 2021

Uh oh!

theopenlab-ci bot commented Jun 24, 2021

Uh oh!

Einsfier commented Jun 24, 2021

Uh oh!

thaJeztah commented Jun 24, 2021

Uh oh!

theopenlab-ci bot commented Jun 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid commented Jun 24, 2021

Uh oh!

Einsfier commented Jun 24, 2021

Uh oh!

fuweid commented Jun 26, 2021

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Einsfier Jul 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crosbymichael left a comment

Choose a reason for hiding this comment

Uh oh!

mupasak commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Aug 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Einsfier commented Jun 24, 2021 •

edited

Loading

Einsfier Jul 23, 2021 •

edited

Loading

mupasak commented Aug 9, 2021 •

edited

Loading