Skip to content

bugfix: change the flag of open log fifo to avoid containerd hang on syscall open#4906

Merged
estesp merged 1 commit intocontainerd:masterfrom
payall4u:bugfix/fix-open-shim-fifo
Feb 1, 2021
Merged

bugfix: change the flag of open log fifo to avoid containerd hang on syscall open#4906
estesp merged 1 commit intocontainerd:masterfrom
payall4u:bugfix/fix-open-shim-fifo

Conversation

@payall4u
Copy link
Contributor

@payall4u payall4u commented Jan 5, 2021

From #4905
If shim never call openLog or died, some thread of containerd will hang on openat.
Considering #3266 and #4046, open the fifo with RDWR and close the fifo-wrapper when shim closed.

Signed-off-by: payall4u payall4u@qq.com

@k8s-ci-robot
Copy link

Hi @payall4u. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 5, 2021

Build succeeded.

@fuweid
Copy link
Member

fuweid commented Jan 12, 2021

@payall4u could you mind to rebase from upstream? Thanks

@fuweid fuweid added the status/needs-update Awaiting contributor update label Jan 12, 2021
@payall4u payall4u force-pushed the bugfix/fix-open-shim-fifo branch from 8e3ff22 to 506f7bd Compare January 12, 2021 12:49
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 12, 2021

Build succeeded.

@payall4u
Copy link
Contributor Author

@payall4u could you mind to rebase from upstream? Thanks

Done. Do I need to modify that windows test?

@payall4u payall4u force-pushed the bugfix/fix-open-shim-fifo branch from 506f7bd to fd6deeb Compare January 26, 2021 12:20
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 26, 2021

Build succeeded.

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AkihiroSuda
Copy link
Member

Signed-off-by: payall4u

Please use real name

@fuweid
Copy link
Member

fuweid commented Jan 27, 2021

I tried to add hook to reproduce this issue but failed. I found that fifo can handle this https://github.com/containerd/fifo/blob/master/fifo.go#L228. Not sure what the root cause is.

@fuweid
Copy link
Member

fuweid commented Jan 27, 2021

if !config.NoSetupLogger {
                        if err := setLogger(ctx, idFlag); err != nil {
                                return err
                        }
                }
                client := NewShimClient(ctx, service, signals)
                if err := client.Serve(); err != nil {
                        if err != context.Canceled {
                                return err
                        }
                }
                select {
                case <-publisher.Done():
                        return nil
                case <-time.After(5 * time.Second):
                        return errors.New("publisher not closed")
                }

Open Fifo and start server. If containerd fails to start shim, the defer will close the fifo.

func (b *binary) Start(ctx context.Context, opts *types.Any, onClose func()) (_ *shim, err error) {
        f, err := openShimLog(ctx, b.bundle, client.AnonDialer)
        if err != nil {
                return nil, errors.Wrap(err, "open shim log pipe")
        }
        defer func() {
                if err != nil {
                        f.Close()
                }
        }()
       ...
       address := strings.TrimSpace(string(out))
        conn, err := client.Connect(address, client.AnonDialer)
        if err != nil {
                return nil, err
        }
        ...
}

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need feedback

@payall4u
Copy link
Contributor Author

payall4u commented Jan 27, 2021

Sure, shim will open fifo first. I think I got wrong point.
It can be reproduced by restart containerd with containerd-shim-runc-v2.
When loadShim fifo.go#L74 some log fifo of container (no shim group leader) will be open by function openShimLog bug it has no shim work on.
I think we should check if container is the leader of group first.

@fuweid
Copy link
Member

fuweid commented Jan 27, 2021

@payall4u Yeah. I guess we should wrap cancelShimLog in ttrpcCloseOnHook when loadShim. Therefore, if the shim connect closes, the goroutine will stop by canceled ctx.

@fuweid
Copy link
Member

fuweid commented Jan 27, 2021

how to reproduce issue:

ctr run -d docker.io/library/busybox:1.25 testing2 sleep 1d
ctr run -d --label io.containerd.runc.v2.group=testing2 docker.io/library/busybox:1.25 testing3 sleep 1d

And ctr pprof goroutines to dump and see that

goroutine 9792 [syscall]:
syscall.Syscall6(0x101, 0xffffffffffffff9c, 0xc0003a8de0, 0x80000, 0x0, 0x0, 0x0, 0x2a8, 0x1, 0x11c0)
        /usr/local/go/src/syscall/asm_linux_amd64.s:41 +0x5
syscall.openat(0xffffffffffffff9c, 0xc00060aeb0, 0x10, 0x80000, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:68 +0xc8
syscall.Open(...)
        /usr/local/go/src/syscall/syscall_linux.go:152
os.openFileNolog(0xc00060aeb0, 0x10, 0x0, 0xc000000000, 0xc00066ef88, 0xc000298f20, 0xc0001c96e0)
        /usr/local/go/src/os/file_unix.go:200 +0x91
os.OpenFile(0xc00060aeb0, 0x10, 0x0, 0x0, 0x0, 0x5593efde6f60, 0xc00010eae0)
        /usr/local/go/src/os/file.go:327 +0x65
github.com/containerd/fifo.openFifo.func2(0x0, 0xc00049d0c0, 0x0, 0xc00047f7a0, 0x5593f15dfd60, 0xc00049d040)
        /root/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:130 +0x371
created by github.com/containerd/fifo.openFifo
        /root/go/src/github.com/containerd/containerd/vendor/github.com/containerd/fifo/fifo.go:123 +0x265

And

➜  containerd git:(master) ps -L -o ppid,pid,spid,comm,wchan:32  -C containerd
 PPID   PID  SPID COMMAND         WCHAN
    1  9458  9458 containerd      futex_wait_queue_me
    1  9458  9459 containerd      futex_wait_queue_me
    1  9458  9460 containerd      futex_wait_queue_me
    1  9458  9462 containerd      futex_wait_queue_me
    1  9458  9463 containerd      futex_wait_queue_me
    1  9458  9464 containerd      futex_wait_queue_me
    1  9458  9477 containerd      ep_poll
    1  9458  9478 containerd      futex_wait_queue_me
    1  9458  9479 containerd      futex_wait_queue_me
    1  9458  9480 containerd      ep_poll
    1  9458  9481 containerd      futex_wait_queue_me
    1  9458  9482 containerd      ep_poll
    1  9458  9483 containerd      futex_wait_queue_me
    1  9458 11016 containerd      futex_wait_queue_me
    1  9458 11017 containerd      futex_wait_queue_me
    1  9458 23746 containerd      pipe_wait
    1  9458 23748 containerd      futex_wait_queue_me
    1  9458 23752 containerd      futex_wait_queue_me
    1  9458 23781 containerd      futex_wait_queue_me
    1  9458 23842 containerd      futex_wait_queue_me
    1  9458 23844 containerd      futex_wait_queue_me
➜  containerd git:(master) cat /proc/23746/stack
[<0>] pipe_wait+0x70/0xc0
[<0>] wait_for_partner+0x26/0x60
[<0>] fifo_open+0x270/0x2d0
[<0>] do_dentry_open+0x21d/0x370
[<0>] vfs_open+0x4f/0x80
[<0>] path_openat+0x6bf/0x18b0
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1ba/0x2c0
[<0>] SyS_openat+0x14/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[<0>] 0xffffffffffffffff

It wastes one thread to hang.

@payall4u Your change is good. Would you mind to add the WDRD when loadShim? Thanks!

And after this patch, the containerd might still have issue about opening too many fd when use grouping shimV2. But it is better than wasting threads :).

@payall4u
Copy link
Contributor Author

payall4u commented Jan 27, 2021

Would you mind to add the WDRD when loadShim?

OK, and I'm checking for the bug shim-v2 opening too many fd recently 😢 . I think it opened by exec.

@fuweid
Copy link
Member

fuweid commented Jan 27, 2021

I think it opened by exec.

you can show the result by lsof -p $(pidof containerd).

@fuweid fuweid added this to the 1.5 milestone Jan 27, 2021
@payall4u payall4u force-pushed the bugfix/fix-open-shim-fifo branch from fd6deeb to 8a0f94c Compare January 31, 2021 04:59
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 31, 2021

Build succeeded.

Signed-off-by: Zhiyu Li <payall4u@qq.com>
@payall4u payall4u force-pushed the bugfix/fix-open-shim-fifo branch from 8a0f94c to 957fa33 Compare January 31, 2021 11:01
@theopenlab-ci
Copy link

theopenlab-ci bot commented Jan 31, 2021

Build succeeded.

@payall4u payall4u requested a review from fuweid February 1, 2021 03:25
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit 49c5c14 into containerd:master Feb 1, 2021
@payall4u payall4u deleted the bugfix/fix-open-shim-fifo branch February 1, 2021 14:47
@estesp
Copy link
Member

estesp commented Feb 1, 2021

Cherry picked into #4971 for release/1.4

@estesp estesp added cherry-picked/1.4.x PR commits are cherry picked into the release/1.4 branch and removed cherry-pick/1.4.x labels Feb 1, 2021
fuweid added a commit to fuweid/containerd that referenced this pull request Mar 18, 2021
After containerd#4906, containerd opens fifo in read/write mode in linux platform
The original comment doesn't correct and is removed by containerd#5174.

```
// original comment

// When using a multi-container shim, the fifo of the 2nd to Nth
// container will not be opened when the ctx is done. This will
// cause an ErrReadClosed that can be ignored.
```

However, we should add comment for checkCopyShimLogError to mention why
we call checkCopyShimLogError. The checkCopyShimLogError, it is to prevent
the flood of expected error messages after task die and the expected
errors depend on platform.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/1.4.x PR commits are cherry picked into the release/1.4 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants