Skip to content

fix race condition problem in streamwatcher#98653

Merged
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
mandelsoft:stream
Mar 8, 2021
Merged

fix race condition problem in streamwatcher#98653
k8s-ci-robot merged 3 commits intokubernetes:masterfrom
mandelsoft:stream

Conversation

@mandelsoft
Copy link
Copy Markdown
Contributor

@mandelsoft mandelsoft commented Feb 1, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
The streamwatcher has a synchronization/race-condition problem that may lead to
a go routine blocking forever when closing a stream watch.

This occasionally happens, when informers are cancelled together with the
watch request using the stop channel, which leads to an increasing
number of blocked go routines, if informers are dynamicaly created and deleted
again.

The function receive checks under a lock whether the watch has been stopped,
before an error from the watch stream is reported to the result channel.
The problem here is, that in between the watcher might be stopped by
calling the Stop method. In the actual code this is done by the
cache.Reflector using the streamwatcher by a defer (method watchHandler)
which is executed after
the caller already stopped reading from the result channel.
As a result the stopping flag might be set after the check
and trying to send the error event blocks this send operation forever,
because there will never be a receiver again.

The fix introduces a dedicated local stop channel that is closed by the
Stop method and used in a select statement together with the send
operation to finally abort the loop.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 1, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mandelsoft. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 1, 2021
@k8s-ci-robot k8s-ci-robot requested review from ncdc and sttts February 1, 2021 08:26
@fedebongio
Copy link
Copy Markdown
Contributor

/assign @wojtek-t
/cc @lavalamp @yliaog
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 2, 2021
@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Feb 2, 2021

Looks correct, but please make a test?

@wojtek-t
Copy link
Copy Markdown
Member

wojtek-t commented Feb 3, 2021

Looks correct, but please make a test?

+1

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 3, 2021
@mandelsoft
Copy link
Copy Markdown
Contributor Author

mandelsoft commented Feb 3, 2021

@lavalamp, @wojtek-t So, added a test case for the race condition. It is a little bit tricky to enforce the problematic execution flow with several involved go routines and the go scheduling mechanism.

I hope the test is significant, I tried both scenarios: with and without the fix to check whether an error occurs if the race happens without the fix.

@mandelsoft mandelsoft force-pushed the stream branch 2 times, most recently from ef17c4f to 6fe354a Compare February 3, 2021 13:06
@yliaog
Copy link
Copy Markdown
Contributor

yliaog commented Feb 3, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 3, 2021
reporter Reporter
result chan Event
stopped bool
done chan struct{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use <-chan struct{}?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2021
@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Mar 5, 2021

I never thought I'd say this, but I think I want this fix even if we have to leave out the test. We can't merge in a flaky test.

Can you make this PR have just the fix and a giant "TODO: figure out how to test that ____ can't race with ____"?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2021
@mandelsoft
Copy link
Copy Markdown
Contributor Author

@lavalamp I just reverted to the original commit containing only the fix.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 5, 2021
The streamwatcher has a synchronization problem that may lead to
a go routine blocking forever when closing a stream watch.

This occasionally happens, when informers are cancelled together with the
watch request using the stop channel, which leads to an increaing
number of blocked go routines, if imformers are dynamicaly created and deleted
again.

The function `receive` checks under a lock whether the watch has been stopped,
before an error is reported to the result channel.
The problem here is, that in between the watcher might be stopped by
calling the `Stop` method. In the actual code this is done by the
`cache.Reflector` using the streamwatcher by a defer which is executed after
the caller already stopped reading from the result channel.
As a result the stopping flag might be set after the check
and trying to send the error event blocks this send operation forever,
because there will never be a receiver again.

The fix introduces a dedicated local stop channel that is closed by the
`Stop` method and used in a select statement together with the send
operation to finally abort the loop.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2021
// As a result the stopping flag might be set after the check
// and trying to send the error event blocks this send operation forever,
// because there will never be a receiver again.
// This results in dead go routines trying to send on the result channel, forever.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reread this super carefully and given that line 114 wasn't sufficient to solve the problem, are we sure that this is? In either case, there is the possibility that an event is waiting in the result channel after the Stop function is called, no?

Copy link
Copy Markdown
Contributor Author

@mandelsoft mandelsoft Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel size is one, therefore it is a synchronous exchange. Because of the common select the channel send will either be skipped if the done channel is closed in between, or aborted if the done channel is closed just after the send operation has already been started. If the send is aborted there is no handshake and the channel is finally garbage collected.

But I think I've an even simpler solution that can even omit the stopping function. And I've found a very simple test that runs on my side without false failures.
But because of the go scheduling there might be very few false accepts (test reports ok, although the error is still present). (<<0.1%). After changing from Gosched to 10 Milli sleeps, it even does not occur on my machine with count=10000.
I don't know why I hadn't this idea earlier. But may be some things need their time.

I'll add this with an additional commit on top.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2021
@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Mar 8, 2021

/lgtm
/approve

Thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, mandelsoft

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit ab7d68a into kubernetes:master Mar 8, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 8, 2021
kevindelgado pushed a commit to kevindelgado/kubernetes that referenced this pull request Apr 5, 2021
To be able to implement controllers that are dynamically deciding
on which resources to watch, it is required to get rid of
dedicated watches and event handlers again. This requires the
possibility to remove event handlers from SharedIndexInformers again.
Stopping an informer is not sufficient, because there might
be multiple controllers in a controller manager that independently
decide which resources to watch.

Unfortunately the ResourceEventHandler interface encourages to use
value objects for handlers (like the ResourceEventHandlerFuncs
struct, that uses value receivers to implement the interface).
Go does not support comparison of function pointers and therefore
the comparison of such structs is not possible, also. Such handlers
cannot be matched again after they have been added and therefore it
is only possible to remove handlers that are comparable.
Fortunately struct based handlers can also be passed by reference.
The user of the interface can therefore still use those
handlers and remove them again, by switching to a pointer argument.

The remove method checks whether a handler can be compared
and ignores uncomparable handlers in the removal process.
Removing of uncomparable handlers result in an error return.

Remark: If as the result of a handler removal a complete informer
should be disabled it is higly recommended to incorporate
pull request kubernetes#98653, which fixes a race condition when stopping
watches for an informer using the stop channel.
kevindelgado pushed a commit to kevindelgado/kubernetes that referenced this pull request Apr 6, 2021
To be able to implement controllers that are dynamically deciding
on which resources to watch, it is required to get rid of
dedicated watches and event handlers again. This requires the
possibility to remove event handlers from SharedIndexInformers again.
Stopping an informer is not sufficient, because there might
be multiple controllers in a controller manager that independently
decide which resources to watch.

Unfortunately the ResourceEventHandler interface encourages to use
value objects for handlers (like the ResourceEventHandlerFuncs
struct, that uses value receivers to implement the interface).
Go does not support comparison of function pointers and therefore
the comparison of such structs is not possible, also. Such handlers
cannot be matched again after they have been added and therefore it
is only possible to remove handlers that are comparable.
Fortunately struct based handlers can also be passed by reference.
The user of the interface can therefore still use those
handlers and remove them again, by switching to a pointer argument.

The remove method checks whether a handler can be compared
and ignores uncomparable handlers in the removal process.
Removing of uncomparable handlers result in an error return.

Remark: If as the result of a handler removal a complete informer
should be disabled it is higly recommended to incorporate
pull request kubernetes#98653, which fixes a race condition when stopping
watches for an informer using the stop channel.
kevindelgado pushed a commit to kevindelgado/kubernetes that referenced this pull request Apr 29, 2021
To be able to implement controllers that are dynamically deciding
on which resources to watch, it is required to get rid of
dedicated watches and event handlers again. This requires the
possibility to remove event handlers from SharedIndexInformers again.
Stopping an informer is not sufficient, because there might
be multiple controllers in a controller manager that independently
decide which resources to watch.

Unfortunately the ResourceEventHandler interface encourages to use
value objects for handlers (like the ResourceEventHandlerFuncs
struct, that uses value receivers to implement the interface).
Go does not support comparison of function pointers and therefore
the comparison of such structs is not possible, also. Such handlers
cannot be matched again after they have been added and therefore it
is only possible to remove handlers that are comparable.
Fortunately struct based handlers can also be passed by reference.
The user of the interface can therefore still use those
handlers and remove them again, by switching to a pointer argument.

The remove method checks whether a handler can be compared
and ignores uncomparable handlers in the removal process.
Removing of uncomparable handlers result in an error return.

Remark: If as the result of a handler removal a complete informer
should be disabled it is higly recommended to incorporate
pull request kubernetes#98653, which fixes a race condition when stopping
watches for an informer using the stop channel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants