fix race condition problem in streamwatcher#98653
fix race condition problem in streamwatcher#98653k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Conversation
|
Hi @mandelsoft. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Looks correct, but please make a test? |
+1 |
|
@lavalamp, @wojtek-t So, added a test case for the race condition. It is a little bit tricky to enforce the problematic execution flow with several involved go routines and the go scheduling mechanism. I hope the test is significant, I tried both scenarios: with and without the fix to check whether an error occurs if the race happens without the fix. |
ef17c4f to
6fe354a
Compare
|
/ok-to-test |
| reporter Reporter | ||
| result chan Event | ||
| stopped bool | ||
| done chan struct{} |
|
I never thought I'd say this, but I think I want this fix even if we have to leave out the test. We can't merge in a flaky test. Can you make this PR have just the fix and a giant "TODO: figure out how to test that ____ can't race with ____"? |
|
@lavalamp I just reverted to the original commit containing only the fix. |
The streamwatcher has a synchronization problem that may lead to a go routine blocking forever when closing a stream watch. This occasionally happens, when informers are cancelled together with the watch request using the stop channel, which leads to an increaing number of blocked go routines, if imformers are dynamicaly created and deleted again. The function `receive` checks under a lock whether the watch has been stopped, before an error is reported to the result channel. The problem here is, that in between the watcher might be stopped by calling the `Stop` method. In the actual code this is done by the `cache.Reflector` using the streamwatcher by a defer which is executed after the caller already stopped reading from the result channel. As a result the stopping flag might be set after the check and trying to send the error event blocks this send operation forever, because there will never be a receiver again. The fix introduces a dedicated local stop channel that is closed by the `Stop` method and used in a select statement together with the send operation to finally abort the loop.
| // As a result the stopping flag might be set after the check | ||
| // and trying to send the error event blocks this send operation forever, | ||
| // because there will never be a receiver again. | ||
| // This results in dead go routines trying to send on the result channel, forever. |
There was a problem hiding this comment.
I reread this super carefully and given that line 114 wasn't sufficient to solve the problem, are we sure that this is? In either case, there is the possibility that an event is waiting in the result channel after the Stop function is called, no?
There was a problem hiding this comment.
The channel size is one, therefore it is a synchronous exchange. Because of the common select the channel send will either be skipped if the done channel is closed in between, or aborted if the done channel is closed just after the send operation has already been started. If the send is aborted there is no handshake and the channel is finally garbage collected.
But I think I've an even simpler solution that can even omit the stopping function. And I've found a very simple test that runs on my side without false failures.
But because of the go scheduling there might be very few false accepts (test reports ok, although the error is still present). (<<0.1%). After changing from Gosched to 10 Milli sleeps, it even does not occur on my machine with count=10000.
I don't know why I hadn't this idea earlier. But may be some things need their time.
I'll add this with an additional commit on top.
|
/lgtm Thank you! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, mandelsoft The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
To be able to implement controllers that are dynamically deciding on which resources to watch, it is required to get rid of dedicated watches and event handlers again. This requires the possibility to remove event handlers from SharedIndexInformers again. Stopping an informer is not sufficient, because there might be multiple controllers in a controller manager that independently decide which resources to watch. Unfortunately the ResourceEventHandler interface encourages to use value objects for handlers (like the ResourceEventHandlerFuncs struct, that uses value receivers to implement the interface). Go does not support comparison of function pointers and therefore the comparison of such structs is not possible, also. Such handlers cannot be matched again after they have been added and therefore it is only possible to remove handlers that are comparable. Fortunately struct based handlers can also be passed by reference. The user of the interface can therefore still use those handlers and remove them again, by switching to a pointer argument. The remove method checks whether a handler can be compared and ignores uncomparable handlers in the removal process. Removing of uncomparable handlers result in an error return. Remark: If as the result of a handler removal a complete informer should be disabled it is higly recommended to incorporate pull request kubernetes#98653, which fixes a race condition when stopping watches for an informer using the stop channel.
To be able to implement controllers that are dynamically deciding on which resources to watch, it is required to get rid of dedicated watches and event handlers again. This requires the possibility to remove event handlers from SharedIndexInformers again. Stopping an informer is not sufficient, because there might be multiple controllers in a controller manager that independently decide which resources to watch. Unfortunately the ResourceEventHandler interface encourages to use value objects for handlers (like the ResourceEventHandlerFuncs struct, that uses value receivers to implement the interface). Go does not support comparison of function pointers and therefore the comparison of such structs is not possible, also. Such handlers cannot be matched again after they have been added and therefore it is only possible to remove handlers that are comparable. Fortunately struct based handlers can also be passed by reference. The user of the interface can therefore still use those handlers and remove them again, by switching to a pointer argument. The remove method checks whether a handler can be compared and ignores uncomparable handlers in the removal process. Removing of uncomparable handlers result in an error return. Remark: If as the result of a handler removal a complete informer should be disabled it is higly recommended to incorporate pull request kubernetes#98653, which fixes a race condition when stopping watches for an informer using the stop channel.
To be able to implement controllers that are dynamically deciding on which resources to watch, it is required to get rid of dedicated watches and event handlers again. This requires the possibility to remove event handlers from SharedIndexInformers again. Stopping an informer is not sufficient, because there might be multiple controllers in a controller manager that independently decide which resources to watch. Unfortunately the ResourceEventHandler interface encourages to use value objects for handlers (like the ResourceEventHandlerFuncs struct, that uses value receivers to implement the interface). Go does not support comparison of function pointers and therefore the comparison of such structs is not possible, also. Such handlers cannot be matched again after they have been added and therefore it is only possible to remove handlers that are comparable. Fortunately struct based handlers can also be passed by reference. The user of the interface can therefore still use those handlers and remove them again, by switching to a pointer argument. The remove method checks whether a handler can be compared and ignores uncomparable handlers in the removal process. Removing of uncomparable handlers result in an error return. Remark: If as the result of a handler removal a complete informer should be disabled it is higly recommended to incorporate pull request kubernetes#98653, which fixes a race condition when stopping watches for an informer using the stop channel.
What type of PR is this?
/kind bug
What this PR does / why we need it:
The streamwatcher has a synchronization/race-condition problem that may lead to
a go routine blocking forever when closing a stream watch.
This occasionally happens, when informers are cancelled together with the
watch request using the stop channel, which leads to an increasing
number of blocked go routines, if informers are dynamicaly created and deleted
again.
The function
receivechecks under a lock whether the watch has been stopped,before an error from the watch stream is reported to the result channel.
The problem here is, that in between the watcher might be stopped by
calling the
Stopmethod. In the actual code this is done by thecache.Reflectorusing the streamwatcher by a defer (methodwatchHandler)which is executed after
the caller already stopped reading from the result channel.
As a result the stopping flag might be set after the check
and trying to send the error event blocks this send operation forever,
because there will never be a receiver again.
The fix introduces a dedicated local stop channel that is closed by the
Stopmethod and used in a select statement together with the sendoperation to finally abort the loop.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: