:bug: Wait for runnables to stop fix for #350 and #429 by dbenque · Pull Request #664 · kubernetes-sigs/controller-runtime

dbenque · 2019-10-29T12:41:29Z

This PR fixes 🐛 #350 and 🐛 #429

The manager.Start function now returns only when all Runnables have properly returned or timeout.

It is possible to define the Timeout value with the manager.Options

k8s-ci-robot · 2019-10-29T12:41:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dbenque
To complete the pull request process, please assign pwittrock
You can assign the PR to them by writing /assign @pwittrock in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-10-29T12:41:37Z

Welcome @dbenque!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2019-10-29T12:41:37Z

Hi @dbenque. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dbenque · 2019-10-29T14:51:20Z

/assign @pwittrock

alvaroaleman · 2019-10-30T22:03:55Z

@dbenque can you fix the conflicts?

dbenque · 2019-11-13T17:57:26Z

@alvaroaleman the PR has been rebase and conflict resolved.

Note that I had to rework part of fe4ada0

This PR #664 is giving same result but it is not black-holing runnable errors:fe4ada0#diff-77faf6b20512574869434402d5c5b6a2R179

This is important because we want to wait for runnable to stop, and so we must catch and handle errors while runnable are stopping.

mengqiy · 2019-11-15T00:40:56Z

/ok-to-test

mengqiy · 2019-11-15T00:53:14Z

This reverts many changes in #651
/assign @DirectXMan12

dbenque · 2019-11-15T16:11:35Z

@DirectXMan12 regarding #651 , this PR is achieving the same and also catch all errors of runnables during the teardown period. There should not be any error silently dropped with that PR and at the same time we wait for all runnables stop (or timeout).
I hope that help for the review.

alexeldeib · 2019-12-03T20:44:35Z

I like this approach. It seems to me like the error draining works nicely here. The same blocking mentioned in #651 (comment) can occur if more than one controller errors out around L392, but since the stop routine drains the channel while it handles proper shutdown, it won't actually block the controllers from exiting. I like this more than the error signaler, personally.

One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well?

dbenque · 2019-12-06T14:52:41Z

One thing I noticed, the runnables are all wired up but manager brings up the metrics endpoint and healthprobes separately. Do we care about gracefully terminating those as well?

@alexeldeib , you are right and to avoid that on my side I explicitly disable the metrics ManagerOptions.MetricsBindAddress = "0" embedded in the manager: I have have create a dedicated runnable for metrics and also one for the healthprobes. Then I start them like any other runnable and benefit from the stop sequence implemented in that PR.

DirectXMan12

minor comments inline, agree with @alexeldeib that we should be treating the servers as runnables to -- we don't want goroutine leaks on shutdown again.

DirectXMan12 · 2019-12-10T00:38:10Z

pkg/manager/internal.go

+		return err
+	}
+}
+func (cm *controllerManager) engageStopProcedure(stopComplete chan struct{}) error {


Godoc: https://www.startrek.com//sites/default/files/images/2019-01/d7b431b1a0cc5f032399870ff4710743.jpeg

DirectXMan12 · 2019-12-10T00:41:51Z

pkg/manager/internal.go

+	return cm.waitForRunnableToEnd()
+}
+
+func (cm *controllerManager) waitForRunnableToEnd() error {


this whole section needs an overview comment of the stop procedure stuff

DirectXMan12 · 2019-12-10T00:44:05Z

pkg/manager/internal.go

+	allStopped := make(chan struct{})
+
+	go func() {
+		cm.waitForRunnable.Wait()


this'll leak through a timeout

not sure there's a good way around it though

DirectXMan12 · 2019-12-10T00:53:54Z

pkg/manager/internal.go

 	go func() {
 		if err := cm.startCache(cm.internalStop); err != nil {
-			cm.errSignal.SignalError(err)
+			cm.errChan <- err


I kinda feel like we should never have anything writing to the error channel directly like this, and instead just wrap everything in a runnable to avoid accidentally forgetting to increment the runnable counter.

DirectXMan12 · 2019-12-10T00:55:42Z

(as a follow up PR for someone -- a test that ensured we don't add any additional leaked goroutines in new code would be nice -- just start the manager then stop it, and use runtime to check the goroutine count)

DirectXMan12 · 2019-12-10T00:55:57Z

(I'll file an issue)

DirectXMan12 · 2019-12-10T00:57:39Z

(#724)

k8s-ci-robot · 2020-01-22T20:37:47Z

@dbenque: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-controller-runtime-test-master	`20e791d`	link	`/test pull-controller-runtime-test-master`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

DirectXMan12 · 2020-02-05T22:25:25Z

hey, are you still interested in working on this

vincepri · 2020-02-20T18:57:24Z

@dbenque are you still interested in working on this change?

k8s-ci-robot · 2020-02-20T18:57:31Z

@dbenque: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-05-21T18:37:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-06-20T19:19:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

vincepri · 2020-06-22T14:33:36Z

Closing for inactivity, feel free to reopen if necessary.

/close

k8s-ci-robot · 2020-06-22T14:33:43Z

@vincepri: Closed this PR.

Details

In response to this:

Closing for inactivity, feel free to reopen if necessary.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 29, 2019

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 29, 2019

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 29, 2019

k8s-ci-robot requested review from droot and pwittrock October 29, 2019 12:41

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 29, 2019

dbenque changed the title ~~🐛 Wait for runnable to stop fix for #350 and #429~~ Wait for runnables to stop fix for #350 and #429 Oct 29, 2019

dbenque force-pushed the david.benque/wait-for-runnable-to-stop branch 2 times, most recently from 97d3e5f to c013ca0 Compare October 29, 2019 12:48

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 29, 2019

k8s-ci-robot assigned pwittrock Oct 29, 2019

dbenque changed the title ~~Wait for runnables to stop fix for #350 and #429~~ 🐛 Wait for runnables to stop fix for #350 and #429 Oct 29, 2019

Wait for Runnables to end before returning in manager Start

20e791d

dbenque force-pushed the david.benque/wait-for-runnable-to-stop branch from c013ca0 to 20e791d Compare November 13, 2019 17:41

droot assigned mengqiy and unassigned pwittrock Nov 14, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 15, 2019

mengqiy assigned DirectXMan12 Nov 15, 2019

DirectXMan12 suggested changes Dec 10, 2019

View reviewed changes

clamoriniere mentioned this pull request Dec 27, 2019

✨ Add httpserver.Server and pkg/debug for pprof support #743

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 20, 2020

vincepri added this to the Next milestone Feb 21, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2020

alvaroaleman mentioned this pull request May 26, 2020

✨ Implement graceful shutdown #967

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2020

k8s-ci-robot closed this Jun 22, 2020

Conversation

dbenque commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 29, 2019

Uh oh!

k8s-ci-robot commented Oct 29, 2019

Uh oh!

k8s-ci-robot commented Oct 29, 2019

Uh oh!

dbenque commented Oct 29, 2019

Uh oh!

alvaroaleman commented Oct 30, 2019

Uh oh!

dbenque commented Nov 13, 2019

Uh oh!

mengqiy commented Nov 15, 2019

Uh oh!

mengqiy commented Nov 15, 2019

Uh oh!

dbenque commented Nov 15, 2019

Uh oh!

alexeldeib commented Dec 3, 2019

Uh oh!

dbenque commented Dec 6, 2019

Uh oh!

DirectXMan12 left a comment

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 commented Dec 10, 2019

Uh oh!

DirectXMan12 commented Dec 10, 2019

Uh oh!

DirectXMan12 commented Dec 10, 2019

Uh oh!

k8s-ci-robot commented Jan 22, 2020

Uh oh!

DirectXMan12 commented Feb 5, 2020

Uh oh!

vincepri commented Feb 20, 2020

Uh oh!

k8s-ci-robot commented Feb 20, 2020

Uh oh!

fejta-bot commented May 21, 2020

Uh oh!

fejta-bot commented Jun 20, 2020

Uh oh!

vincepri commented Jun 22, 2020

Uh oh!

k8s-ci-robot commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dbenque commented Oct 29, 2019 •

edited

Loading