-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Do not tolerate ENOENT while reconnecting shim socket #5648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @povsister. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Oh, good. 🤣 |
Unify timeout dialer and context dialer implementation. Fix ContextDialer to use a real context-aware dialer. Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
Containerd may stuck for hours at startup while reconnecting shim sockets which are unexpected missing. This fix adds a new ContextDialerFunc to allow you customize error handler during the dialing. Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
bd35dcc to
467d717
Compare
|
Build succeeded.
|
1 similar comment
|
Build succeeded.
|
|
@fuweid All done, PTAL. |
|
/ok-to-test |
91a873d to
7269405
Compare
Signed-off-by: Ethan Chen <pov@mahou-shoujo.moe>
7269405 to
4d19774
Compare
|
Build succeeded.
|
| return AnonDialer(address, timeout) | ||
| ctx, cancel := context.WithTimeout(context.TODO(), timeout) | ||
| defer cancel() | ||
| return dialer.ContextDialerFunc(ctx, address, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use https://golang.org/pkg/net/#DialTimeout ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use DialTimeout directly, there is no meaning to unify dialer. Also, the dialer implementation is different between *unix and windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found that we can unify the AnonDialer and AnonReconnectDialer to call ContextDialer, then they can be removed from util_unix.go and util_windows.go
Shall I add another commit to unify the Dialer implementation on windwos ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to keep them separated. ContextDialer looks like for grpc call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to keep them separated. ContextDialer looks like for grpc call.
I mean we can use the ContextDialer to unify the AnonDialer and AnonReconnectDialer 's implementation in util on different platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't think we should align containerd client's dialer with shim's. DRY is good principle. But in this case, the ContextDialer looks like complicated than before and the AnonDialer should be simple.
I am not windows developer so I am not sure we can remove the AnonDialer from different platforms. Just my two cents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the shim, there is no necessary to use ContextDialer which bring one more goroutines to handle that error.
The DialTimeout is more clear than ContextDialer in this case currently.
Since there are more issues reporting this case, I would like to fix that case and then doing that refactor(unify or something).
cc @containerd/reviewers @containerd/maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.. which then uses a new background context
We should probably be calling net.DialContext instead but we'd have to go review the history of this to see why that didn't happen..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also trying to figure out why we drop the given context to call go net.dialer with a timeout value.
The reason I guess is that DialContext func has a receiver and it can not be directly used by simply importing net package, maybe the maintainer didn't notice it and invented the wheels by himself.
Using DialContext could save us lots of effort, I will update this PR later, trying to adapt DialContext.
I don't think it can handle this case. The 5142 needs to handle SIGTERM in runc-shim-v1/v2. |
From the log provided in 5142, it is confirmed that the slow-start issues causes systemd to consider containerd service failed to start after 15min. I have a detailed explanation about how Now the shimv2 simply ignores SIGTERM, so it will keep running while containerd daemon is being restarted/stopped by systemd. When host is rebooted, those shims finally got SIGKILLED, I guess that @joedborg also explicitly configured containerd state to a persistent disk, so containerd will also get the slow-cleanup for dead shims. This is what exactly this PR tries to fix. |
|
In the terminating stage (during shutdown) , the systemd might send the TERM first. If there is any process which ignores the signal, it will take time to shutdown. The shimv2-runc is the one process. And then, the dead shim lefts the bundle in the persist disk and the containerd will take more time to recover it, which can be solved by this pr. |
mikebrow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments/questions
| ) | ||
|
|
||
| // ContextDialer returns a GRPC net.Conn connected to the provided address. | ||
| // It tolerates ENOENT and keeps retrying if provided context has a deadline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this commit this function now passes along the context it receives..
I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline. Are we expecting that case to improve the speed in scenarios where we take an ENOENT? Or is the performance issue just the default dial timeout is too long causing the possibility of an ENOENT and thus a retry would reset the very long default again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we should always tolerate ENOENT otherwise we'll get spurious failures modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would not tolerate ENOENT in the case where the context passed in does not have a deadline.
I explicitly add a hint saying "... if provided context has a deadline." because the original code does so, but the author didn't give any notification or warning about it.
Actually, To avoid any unexpected failure or inconsistent behavior after this patch, I just rewrite this function to make it "more context-aware" without introducing any observable change from caller's view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok... not sure if the two additional lines of descriptive text for the function help or not
| synC = make(chan *dialResult) | ||
| ) | ||
| // timeoutDialer connects to the provided address with a timeout. | ||
| func timeoutDialer(address string, timeout time.Duration) (conn net.Conn, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dialer/timeoutDialer was deprecated in favor of forcing the caller to set context. In the containerd/containerd repo we only use dialer.Dialer a few times. Perhaps we should stop using Dialer in those and just go ahead and recode to use proper context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However.. it's exported and located under pkg, no one knows if there are any 3rd party depends on it. If any member from containerd determines to completely remove the deprecated Dialer, I am glad to help remove any existing usage in the project scope.
crosbymichael
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to review this one to see if there is an actual issue with the handling. I don't support the statedir not being on tmpfs. That is an invalid configuration and we shouldn't jump though hoops in the code to support this.
|
I have tried changing statedir to be on tmpfs and it seems to work as expected. containerd boot up quickly as soon as the node gets rebooted without the ongoing patchset, though I could not see anything clearly mentioned in any document that mandates us to use tmpfs as state dir. It would be better if we add it to the document . |
|
I would like to fix the issue first and then refactor. any thoughts? |
In linux platform, the shim server always listens on the socket before
the containerd task manager dial it. It is unlikely that containerd task
manager should handle reconnect because the shim can't restart. For this
case, the containerd task manager should fail fast if there is ENOENT or
ECONNREFUSED error.
And if the socket file is deleted during cleanup the exited task, it
maybe cause that containerd task manager takes long time to reload the
dead shim. For that task.v2 manager, the race case is like:
```
TaskService.Delete
TaskManager.Delete(runtime/v2/manager.go)
shim.delete(runtime/v2/shim.go)
shimv2api.Shutdown(runtime/v2/task/shim.pb.go)
<- containerd has been killed or restarted somehow
bundle.Delete
```
The shimv2api.Shutdown will cause that the shim deletes socket file
(containerd-shim-runc-v2 does). But the bundle is still there. During
reloading, the containerd will wait for the socket file appears again
in 100 seconds. It is not reasonable. The Reconnect should prevent this
case by fast fail.
Closes: containerd#5648.
Fixes: containerd#5597.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
In linux platform, the shim server always listens on the socket before
the containerd task manager dial it. It is unlikely that containerd task
manager should handle reconnect because the shim can't restart. For this
case, the containerd task manager should fail fast if there is ENOENT or
ECONNREFUSED error.
And if the socket file is deleted during cleanup the exited task, it
maybe cause that containerd task manager takes long time to reload the
dead shim. For that task.v2 manager, the race case is like:
```
TaskService.Delete
TaskManager.Delete(runtime/v2/manager.go)
shim.delete(runtime/v2/shim.go)
shimv2api.Shutdown(runtime/v2/task/shim.pb.go)
<- containerd has been killed or restarted somehow
bundle.Delete
```
The shimv2api.Shutdown will cause that the shim deletes socket file
(containerd-shim-runc-v2 does). But the bundle is still there. During
reloading, the containerd will wait for the socket file appears again
in 100 seconds. It is not reasonable. The Reconnect should prevent this
case by fast fail.
Closes: containerd#5648.
Fixes: containerd#5597.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit f7658e3)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This PR aims to fix #5597 , but also introduces some new utils.
I have separated it into several commits to make review easier.
ContextDialerimplementation, now it's a real context-aware dialer, previously it was simply implemented using context deadline which has no response to context cancellation.ContextDialerhas the exactly same behavior as before, which means it toleratesENOENTwhen context has a deadline.timeoutDialerimplementation, now it uses a context with timeout to simulate timeout setting.ContextDialerFunc, allowing customization to dial error handling. The dialer will keep retrying if the error is tolerable. It's caller's responsibility to cancel or set a deadline for the dialer.This PR also fix #5142