fix race condition problem in streamwatcher by mandelsoft · Pull Request #98653 · kubernetes/kubernetes

mandelsoft · 2021-02-01T08:25:39Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
The streamwatcher has a synchronization/race-condition problem that may lead to
a go routine blocking forever when closing a stream watch.

This occasionally happens, when informers are cancelled together with the
watch request using the stop channel, which leads to an increasing
number of blocked go routines, if informers are dynamicaly created and deleted
again.

The function receive checks under a lock whether the watch has been stopped,
before an error from the watch stream is reported to the result channel.
The problem here is, that in between the watcher might be stopped by
calling the Stop method. In the actual code this is done by the
cache.Reflector using the streamwatcher by a defer (method watchHandler)
which is executed after
the caller already stopped reading from the result channel.
As a result the stopping flag might be set after the check
and trying to send the error event blocks this send operation forever,
because there will never be a receiver again.

The fix introduces a dedicated local stop channel that is closed by the
Stop method and used in a select statement together with the send
operation to finally abort the loop.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2021-02-01T08:25:48Z

Hi @mandelsoft. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fedebongio · 2021-02-02T21:24:01Z

/assign @wojtek-t
/cc @lavalamp @yliaog
/triage accepted

lavalamp · 2021-02-02T21:37:48Z

Looks correct, but please make a test?

wojtek-t · 2021-02-03T07:43:47Z

Looks correct, but please make a test?

+1

mandelsoft · 2021-02-03T12:05:16Z

@lavalamp, @wojtek-t So, added a test case for the race condition. It is a little bit tricky to enforce the problematic execution flow with several involved go routines and the go scheduling mechanism.

I hope the test is significant, I tried both scenarios: with and without the fix to check whether an error occurs if the race happens without the fix.

yliaog · 2021-02-03T19:35:34Z

/ok-to-test

yliaog · 2021-02-24T03:44:15Z

staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go

 	reporter Reporter
 	result   chan Event
-	stopped  bool
+	done     chan struct{}


use <-chan struct{}?

lavalamp · 2021-03-05T00:59:12Z

I never thought I'd say this, but I think I want this fix even if we have to leave out the test. We can't merge in a flaky test.

Can you make this PR have just the fix and a giant "TODO: figure out how to test that ____ can't race with ____"?

mandelsoft · 2021-03-05T15:32:07Z

@lavalamp I just reverted to the original commit containing only the fix.

The streamwatcher has a synchronization problem that may lead to a go routine blocking forever when closing a stream watch. This occasionally happens, when informers are cancelled together with the watch request using the stop channel, which leads to an increaing number of blocked go routines, if imformers are dynamicaly created and deleted again. The function `receive` checks under a lock whether the watch has been stopped, before an error is reported to the result channel. The problem here is, that in between the watcher might be stopped by calling the `Stop` method. In the actual code this is done by the `cache.Reflector` using the streamwatcher by a defer which is executed after the caller already stopped reading from the result channel. As a result the stopping flag might be set after the check and trying to send the error event blocks this send operation forever, because there will never be a receiver again. The fix introduces a dedicated local stop channel that is closed by the `Stop` method and used in a select statement together with the send operation to finally abort the loop.

staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go

lavalamp · 2021-03-05T18:07:29Z

staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go

+					// As a result the stopping flag might be set after the check
+					// and trying to send the error event blocks this send operation forever,
+					// because there will never be a receiver again.
+					// This results in dead go routines trying to send on the result channel, forever.


I reread this super carefully and given that line 114 wasn't sufficient to solve the problem, are we sure that this is? In either case, there is the possibility that an event is waiting in the result channel after the Stop function is called, no?

The channel size is one, therefore it is a synchronous exchange. Because of the common select the channel send will either be skipped if the done channel is closed in between, or aborted if the done channel is closed just after the send operation has already been started. If the send is aborted there is no handshake and the channel is finally garbage collected.

But I think I've an even simpler solution that can even omit the stopping function. And I've found a very simple test that runs on my side without false failures.
But because of the go scheduling there might be very few false accepts (test reports ok, although the error is still present). (<<0.1%). After changing from Gosched to 10 Milli sleeps, it even does not occur on my machine with count=10000.
I don't know why I hadn't this idea earlier. But may be some things need their time.

I'll add this with an additional commit on top.

lavalamp · 2021-03-08T17:25:25Z

/lgtm
/approve

Thank you!

k8s-ci-robot · 2021-03-08T17:25:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, mandelsoft

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

To be able to implement controllers that are dynamically deciding on which resources to watch, it is required to get rid of dedicated watches and event handlers again. This requires the possibility to remove event handlers from SharedIndexInformers again. Stopping an informer is not sufficient, because there might be multiple controllers in a controller manager that independently decide which resources to watch. Unfortunately the ResourceEventHandler interface encourages to use value objects for handlers (like the ResourceEventHandlerFuncs struct, that uses value receivers to implement the interface). Go does not support comparison of function pointers and therefore the comparison of such structs is not possible, also. Such handlers cannot be matched again after they have been added and therefore it is only possible to remove handlers that are comparable. Fortunately struct based handlers can also be passed by reference. The user of the interface can therefore still use those handlers and remove them again, by switching to a pointer argument. The remove method checks whether a handler can be compared and ignores uncomparable handlers in the removal process. Removing of uncomparable handlers result in an error return. Remark: If as the result of a handler removal a complete informer should be disabled it is higly recommended to incorporate pull request kubernetes#98653, which fixes a race condition when stopping watches for an informer using the stop channel.

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2021

k8s-ci-robot requested review from ncdc and sttts February 1, 2021 08:26

mandelsoft mentioned this pull request Feb 1, 2021

support removal of event handlers from SharedIndexInformers #98657

Closed

k8s-ci-robot assigned wojtek-t Feb 2, 2021

k8s-ci-robot requested review from lavalamp and yliaog February 2, 2021 21:24

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 2, 2021

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 3, 2021

mandelsoft force-pushed the stream branch 2 times, most recently from ef17c4f to 6fe354a Compare February 3, 2021 13:06

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 3, 2021

yliaog reviewed Feb 24, 2021

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2021

mandelsoft force-pushed the stream branch from 50fbc41 to e57bfd1 Compare March 5, 2021 15:29

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2021

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 5, 2021

mandelsoft force-pushed the stream branch from db3a775 to a81ad38 Compare March 5, 2021 16:15

add comment describing the race condition + TODO for appropriate test

932f98a

mandelsoft force-pushed the stream branch from a81ad38 to 932f98a Compare March 5, 2021 16:59

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2021

lavalamp reviewed Mar 5, 2021

View reviewed changes

staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go Show resolved Hide resolved

lavalamp reviewed Mar 5, 2021

View reviewed changes

simplier fix + test for race condition

2355ceb

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2021

k8s-ci-robot assigned lavalamp Mar 8, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 8, 2021

k8s-ci-robot merged commit ab7d68a into kubernetes:master Mar 8, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 8, 2021

Conversation

mandelsoft commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 1, 2021

Uh oh!

fedebongio commented Feb 2, 2021

Uh oh!

lavalamp commented Feb 2, 2021

Uh oh!

wojtek-t commented Feb 3, 2021

Uh oh!

mandelsoft commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yliaog commented Feb 3, 2021

Uh oh!

yliaog Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Mar 5, 2021

Uh oh!

mandelsoft commented Mar 5, 2021

Uh oh!

Uh oh!

lavalamp Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

mandelsoft Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Mar 8, 2021

Uh oh!

k8s-ci-robot commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mandelsoft commented Feb 1, 2021 •

edited

Loading

mandelsoft commented Feb 3, 2021 •

edited

Loading

mandelsoft Mar 6, 2021 •

edited

Loading