bugfix: change the flag of open log fifo to avoid containerd hang on syscall open#4906
Conversation
|
Hi @payall4u. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Build succeeded.
|
|
@payall4u could you mind to rebase from upstream? Thanks |
8e3ff22 to
506f7bd
Compare
|
Build succeeded.
|
Done. Do I need to modify that windows test? |
506f7bd to
fd6deeb
Compare
|
Build succeeded.
|
Please use real name |
|
I tried to add hook to reproduce this issue but failed. I found that fifo can handle this https://github.com/containerd/fifo/blob/master/fifo.go#L228. Not sure what the root cause is. |
Open Fifo and start server. If containerd fails to start shim, the defer will close the fifo. |
|
Sure, shim will open fifo first. I think I got wrong point. |
|
@payall4u Yeah. I guess we should wrap cancelShimLog in ttrpcCloseOnHook when loadShim. Therefore, if the shim connect closes, the goroutine will stop by canceled ctx. |
|
how to reproduce issue: And ctr pprof goroutines to dump and see that And It wastes one thread to hang. @payall4u Your change is good. Would you mind to add the WDRD when loadShim? Thanks! And after this patch, the containerd might still have issue about opening too many fd when use grouping shimV2. But it is better than wasting threads :). |
OK, and I'm checking for the bug shim-v2 opening too many fd recently 😢 . I think it opened by exec. |
you can show the result by |
fd6deeb to
8a0f94c
Compare
|
Build succeeded.
|
Signed-off-by: Zhiyu Li <payall4u@qq.com>
8a0f94c to
957fa33
Compare
|
Build succeeded.
|
|
Cherry picked into #4971 for |
After containerd#4906, containerd opens fifo in read/write mode in linux platform The original comment doesn't correct and is removed by containerd#5174. ``` // original comment // When using a multi-container shim, the fifo of the 2nd to Nth // container will not be opened when the ctx is done. This will // cause an ErrReadClosed that can be ignored. ``` However, we should add comment for checkCopyShimLogError to mention why we call checkCopyShimLogError. The checkCopyShimLogError, it is to prevent the flood of expected error messages after task die and the expected errors depend on platform. Signed-off-by: Wei Fu <fuweid89@gmail.com>
From #4905
If shim never call openLog or died, some thread of containerd will hang on openat.
Considering #3266 and #4046, open the fifo with RDWR and close the fifo-wrapper when shim closed.
Signed-off-by: payall4u payall4u@qq.com