🐛 Wait for runnables to stop fix for #350 and #429#664
🐛 Wait for runnables to stop fix for #350 and #429#664dbenque wants to merge 1 commit intokubernetes-sigs:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dbenque The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @dbenque! |
|
Hi @dbenque. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
97d3e5f to
c013ca0
Compare
|
/assign @pwittrock |
|
@dbenque can you fix the conflicts? |
c013ca0 to
20e791d
Compare
|
@alvaroaleman the PR has been rebase and conflict resolved. Note that I had to rework part of fe4ada0 This PR #664 is giving same result but it is not black-holing runnable errors:fe4ada0#diff-77faf6b20512574869434402d5c5b6a2R179 This is important because we want to wait for runnable to stop, and so we must catch and handle errors while runnable are stopping. |
|
/ok-to-test |
|
This reverts many changes in #651 |
|
@DirectXMan12 regarding #651 , this PR is achieving the same and also catch all errors of runnables during the teardown period. There should not be any error silently dropped with that PR and at the same time we wait for all runnables stop (or timeout). |
|
I like this approach. It seems to me like the error draining works nicely here. The same blocking mentioned in #651 (comment) can occur if more than one controller errors out around L392, but since the stop routine drains the channel while it handles proper shutdown, it won't actually block the controllers from exiting. I like this more than the error signaler, personally. One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well? |
@alexeldeib , you are right and to avoid that on my side I explicitly disable the metrics |
DirectXMan12
left a comment
There was a problem hiding this comment.
minor comments inline, agree with @alexeldeib that we should be treating the servers as runnables to -- we don't want goroutine leaks on shutdown again.
| return err | ||
| } | ||
| } | ||
| func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error { |
There was a problem hiding this comment.
| return cm.waitForRunnableToEnd() | ||
| } | ||
|
|
||
| func (cm *controllerManager) waitForRunnableToEnd() error { |
There was a problem hiding this comment.
this whole section needs an overview comment of the stop procedure stuff
| allStopped := make(chan struct{}) | ||
|
|
||
| go func() { | ||
| cm.waitForRunnable.Wait() |
There was a problem hiding this comment.
this'll leak through a timeout
There was a problem hiding this comment.
not sure there's a good way around it though
| go func() { | ||
| if err := cm.startCache(cm.internalStop); err != nil { | ||
| cm.errSignal.SignalError(err) | ||
| cm.errChan <- err |
There was a problem hiding this comment.
I kinda feel like we should never have anything writing to the error channel directly like this, and instead just wrap everything in a runnable to avoid accidentally forgetting to increment the runnable counter.
|
(as a follow up PR for someone -- a test that ensured we don't add any additional leaked goroutines in new code would be nice -- just start the manager then stop it, and use |
|
(I'll file an issue) |
|
(#724) |
|
@dbenque: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
hey, are you still interested in working on this |
|
@dbenque are you still interested in working on this change? |
|
@dbenque: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Closing for inactivity, feel free to reopen if necessary. /close |
|
@vincepri: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR fixes 🐛 #350 and 🐛 #429
The manager.Start function now returns only when all Runnables have properly returned or timeout.
It is possible to define the Timeout value with the manager.Options