Skip to content

Use non-blocking send and retry for exit events#3540

Merged
crosbymichael merged 4 commits intocontainerd:masterfrom
crosbymichael:shim-hang
Aug 20, 2019
Merged

Use non-blocking send and retry for exit events#3540
crosbymichael merged 4 commits intocontainerd:masterfrom
crosbymichael:shim-hang

Conversation

@crosbymichael
Copy link
Copy Markdown
Member

@crosbymichael crosbymichael commented Aug 15, 2019

Closes #3529

Carry of #3529

Comment thread sys/reaper/reaper_unix.go Outdated
Copy link
Copy Markdown
Contributor

@jterry75 jterry75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit. LGTM

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 15, 2019

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 15, 2019

Build succeeded.

@crosbymichael
Copy link
Copy Markdown
Member Author

I've been running the repo case in the original PR and I'm unable to reproduce with these changes. I'll keep it go for now though

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 15, 2019

Build succeeded.

@crosbymichael crosbymichael force-pushed the shim-hang branch 2 times, most recently from db8e36e to 9d0ac80 Compare August 15, 2019 20:28
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 15, 2019

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 15, 2019

Build succeeded.

@Random-Liu
Copy link
Copy Markdown
Member

Random-Liu commented Aug 15, 2019

CRI plugin doesn't reply on TaskExit event for container lifecycle management any more. TaskExit event loss should not be a problem any more.

So LG.

Comment thread sys/reaper/reaper_unix.go Outdated
Comment thread sys/reaper/reaper_unix.go Outdated
@jterry75
Copy link
Copy Markdown
Contributor

CRI plugin doesn't reply on TaskExit event for container lifecycle management any more. TaskExit event loss should not be a problem any more.

So LG.

@Random-Liu - If I could +1k I would...

@Ace-Tang
Copy link
Copy Markdown
Contributor

This looks more reasonable to send exit event 👍

keloyang and others added 3 commits August 16, 2019 13:55
shim.Reap and shim.Default.Wait may deadlock, use Monitor.Notify
to fix this issue.

Signed-off-by: Shukui Yang <keloyangsk@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 16, 2019

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 16, 2019

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 16, 2019

Build succeeded.

@crosbymichael
Copy link
Copy Markdown
Member Author

There are still some issues with this change that I am working through. Its not robust yet to merge

@crosbymichael
Copy link
Copy Markdown
Member Author

Ok, this is solid now and ready for final reviews

Copy link
Copy Markdown
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread sys/reaper/reaper_unix.go Outdated
Comment thread sys/reaper/reaper_unix.go Outdated
@keloyang
Copy link
Copy Markdown
Contributor

This change don't resolve the problem,in my centos7, It's easy to reproduce.

[root@centos3 shim-sigchld]# uname -a
Linux centos3 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The reason is the code call <-timer.C twice, add the following change can fix.

@@ -41,12 +41,13 @@ type subscriber struct {

 func (s *subscriber) close() {
        s.Lock()
+       defer s.Unlock()
+
        if s.closed {
                return
        }
        close(s.c)
        s.closed = true
-       s.Unlock()
 }

 func (s *subscriber) do(fn func()) {
@@ -128,13 +129,14 @@ func (m *Monitor) Subscribe() chan runc.Exit {

 func (s *subscriber) do(fn func()) {
@@ -128,13 +129,14 @@ func (m *Monitor) Subscribe() chan runc.Exit {
 // Unsubscribe to process exit changes
 func (m *Monitor) Unsubscribe(c chan runc.Exit) {
        m.Lock()
+       defer m.Unlock()
+
        s, ok := m.subscribers[c]
        if !ok {
                return
        }
        s.close()
        delete(m.subscribers, c)
-       m.Unlock()
 }

 func (m *Monitor) getSubscribers() map[chan runc.Exit]*subscriber {
@@ -148,18 +150,18 @@ func (m *Monitor) getSubscribers() map[chan runc.Exit]*subscriber {
 }

 func (m *Monitor) notify(e runc.Exit) chan struct{} {
-       const timeout = 10 * time.Millisecond
        // make sure reaper can get exit events as soon as possible,  10ms is long
+       const timeout = time.Millisecond
        var (
                done    = make(chan struct{}, 1)
                timer   = time.NewTimer(timeout)
                success = make(map[chan runc.Exit]struct{})
        )
-       stop(timer)
+       stop(timer, true)

        go func() {
                defer close(done)

               // with consumers, it is not a dead cycle,if a value is specified, there is theoretically a risk of failure.
-               for i := 0; i < 100; i++ {
+               for ;; {
                        var (
                                failed      int
                                subscribers = m.getSubscribers()
@@ -173,27 +175,29 @@ func (m *Monitor) notify(e runc.Exit) chan struct{} {
                                                return
                                        }
                                        timer.Reset(timeout)
+
+                                       needRcv := true
                                        select {
+
+                                       needRcv := true
                                        select {
                                        case s.c <- e:
                                                success[s.c] = struct{}{}
                                        case <-timer.C:
+                                               needRcv = false
                                                failed++
                                        }
                                        // avoid call <-timer.C twice
-                                       stop(timer)
+                                       stop(timer, needRcv)
                                })
                        }
                        // all subscribers received the message
                        if failed == 0 {
                                return
                        }
                        // make sure reaper can get exit events as soon as possible, 100ms is too long
-                       time.Sleep(100 * time.Millisecond)
                }
        }()
        return done
 }
-func stop(timer *time.Timer) {
-       if !timer.Stop() {
+func stop(timer *time.Timer, needRcv bool) {
+       if !timer.Stop() && needRcv {
                <-timer.C
        }
 }

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael
Copy link
Copy Markdown
Member Author

@keloyang I incorporated your changes into this.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Aug 19, 2019

Build succeeded.

@dmcgowan dmcgowan added this to the 1.3 milestone Aug 19, 2019
@dmcgowan
Copy link
Copy Markdown
Member

Ping @Ace-Tang @fuweid @keloyang @darfux Can you prioritize giving an updated review on this today. We would like to get this merged and in the next beta release.

@keloyang
Copy link
Copy Markdown
Contributor

Another tiny optimization, we can change bufferSize back to 32, 2048 is a waste.

const bufferSize = 2048

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Aug 20, 2019

Thanks @keloyang to point it out! I think we should change channel size back to 32.

LGTM


And I tried to change timeout from 1 * time.Millisecond to 100 * time.Millisecond to check performance since I think 1 * time.Millisecond is too short. I found the performance issue of shim, but not related to this pr.

image

We don't need to read config every time because most of process are not init process.

I think we should file patch after this pr.

diff --git a/runtime/v1/shim/service.go b/runtime/v1/shim/service.go
index 37eae7f..57f7438 100644
--- a/runtime/v1/shim/service.go
+++ b/runtime/v1/shim/service.go
@@ -515,17 +515,17 @@ func (s *Service) allProcesses() []process.Process {
 }

 func (s *Service) checkProcesses(e runc.Exit) {
-       shouldKillAll, err := shouldKillAllOnExit(s.bundle)
-       if err != nil {
-               log.G(s.context).WithError(err).Error("failed to check shouldKillAll")
-       }

        for _, p := range s.allProcesses() {
                if p.Pid() == e.Pid {
+                       if ip, ok := p.(*process.Init); ok {
+                               // Ensure all children are killed
+                               shouldKillAll, err := shouldKillAllOnExit(s.bundle)
+                               if err != nil {
+                                       log.G(s.context).WithError(err).Error("failed to check shouldKillAll")
+                               }

-                       if shouldKillAll {
-                               if ip, ok := p.(*process.Init); ok {
-                                       // Ensure all children are killed
+                               if shouldKillAll {
                                        if err := ip.KillAll(s.context); err != nil {
                                                log.G(s.context).WithError(err).WithField("id", ip.ID()).
                                                        Error("failed to kill init's children")

After change, the perf looks better.

image

cc @crosbymichael

@darfux
Copy link
Copy Markdown
Contributor

darfux commented Aug 20, 2019

LGTM

@crosbymichael
Copy link
Copy Markdown
Member Author

Lets merge this one and add PRs for the rest of the optimizations

@crosbymichael crosbymichael merged commit 08061c7 into containerd:master Aug 20, 2019
@crosbymichael crosbymichael deleted the shim-hang branch August 20, 2019 13:31
@estesp
Copy link
Copy Markdown
Member

estesp commented Aug 20, 2019

Cherry-picked to release/1.2 in #3561

@lowang-bh
Copy link
Copy Markdown

mark, looks fixed the known issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.