kvclient: fix gRPC stream leak in rangefeed client by erikgrinaker · Pull Request #80705 · cockroachdb/cockroach

erikgrinaker · 2022-04-28T14:18:32Z

When the DistSender rangefeed client received a RangeFeedError message
and propagated a retryable error up the stack, it would fail to close
the existing gRPC stream, causing stream/goroutine leaks.

Release note (bug fix): Fixed a goroutine leak when internal rangefeed
clients received certain kinds of retriable errors.

When the DistSender rangefeed client received a `RangeFeedError` message and propagated a retryable error up the stack, it would fail to close the existing gRPC stream, causing stream/goroutine leaks. Release note (bug fix): Fixed a goroutine leak when internal rangefeed clients received certain kinds of retriable errors.

cockroach-teamcity · 2022-04-28T14:18:48Z

This change is

tbg · 2022-04-28T18:51:19Z

TIMEOUT: //pkg/kv/kvserver/rangefeed:rangefeed_test (Summary)
/home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-dbg/testlogs/pkg/kv/kvserver/rangefeed/rangefeed_test/test.log

tbg · 2022-04-28T18:52:07Z

=== RUN TestProcessorMemoryBudgetExceeded
-- Test timed out at 2022-04-28 15:23:13 UTC --

srosenberg

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

	// Ensure context is cancelled on all errors, to prevent gRPC stream leaks.
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()

What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate? E.g., is there a better way of documenting a context that the callee must cancel in case of an error? could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?

Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.

Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?

We check for goroutine leaks in tests using leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.

We could write a very targeted regression test for singleRangeFeed that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.

Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?

leaktest already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).

is there a better way of documenting a context that the callee must cancel in case of an error?

This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:

https://github.com/cockroachdb/vendored/blob/38303b08ef278f367bba4cab699716981ba0038d/google.golang.org/grpc/stream.go#L137-L146

could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?

erikgrinaker · 2022-04-29T12:05:23Z

=== RUN TestProcessorMemoryBudgetExceeded
-- Test timed out at 2022-04-28 15:23:13 UTC --

I stressed this for 100k runs both with and without the race detector and didn't see any failures. Will bake this on master for a bit and see if the nightlies catch anything.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?

Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.

Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?

We check for goroutine leaks in tests using leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.

We could write a very targeted regression test for singleRangeFeed that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.

Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?

leaktest already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).

is there a better way of documenting a context that the callee must cancel in case of an error?

This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:

https://github.com/cockroachdb/vendored/blob/38303b08ef278f367bba4cab699716981ba0038d/google.golang.org/grpc/stream.go#L137-L146

could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?

How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?

PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.

srosenberg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.

Thanks Erik! From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events. In fact, that's what the above defer is doing. I'll think a bit more about the sweeper for gRPC goroutines. I do agree the long-lived end-to-end cluster testing would be an effective way of detecting this leak.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):

From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events.

Yes, they're not mutually exclusive. If the stream has already been terminated by other means, the cancelling the context has no effect on it.

erikgrinaker · 2022-04-29T14:27:09Z

TFTRs!

bors r=knz,tbg,srosenberg

craig · 2022-04-29T16:11:14Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker added backport-21.1.x labels Apr 28, 2022

erikgrinaker requested a review from a team April 28, 2022 14:18

erikgrinaker self-assigned this Apr 28, 2022

erikgrinaker requested a review from a team as a code owner April 28, 2022 14:18

knz approved these changes Apr 28, 2022

View reviewed changes

tbg approved these changes Apr 28, 2022

View reviewed changes

srosenberg reviewed Apr 29, 2022

View reviewed changes

erikgrinaker commented Apr 29, 2022

View reviewed changes

srosenberg approved these changes Apr 29, 2022

View reviewed changes

erikgrinaker commented Apr 29, 2022

View reviewed changes

craig bot merged commit 0777ef7 into cockroachdb:master Apr 29, 2022

This was referenced Apr 29, 2022

release-21.1: kvclient: fix gRPC stream leak in rangefeed client #80795

Merged

release-21.2: kvclient: fix gRPC stream leak in rangefeed client #80797

Merged

release-22.1: kvclient: fix gRPC stream leak in rangefeed client #80798

Merged

erikgrinaker deleted the rangefeed-grpc-leak branch May 4, 2022 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvclient: fix gRPC stream leak in rangefeed client#80705

kvclient: fix gRPC stream leak in rangefeed client#80705
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:rangefeed-grpc-leak

erikgrinaker commented Apr 28, 2022

Uh oh!

cockroach-teamcity commented Apr 28, 2022

Uh oh!

tbg commented Apr 28, 2022

Uh oh!

tbg commented Apr 28, 2022

Uh oh!

srosenberg left a comment

Uh oh!

erikgrinaker left a comment

Uh oh!

erikgrinaker commented Apr 29, 2022

Uh oh!

erikgrinaker left a comment

Uh oh!

srosenberg left a comment

Uh oh!

erikgrinaker left a comment

Uh oh!

erikgrinaker commented Apr 29, 2022 •

edited

Loading

Uh oh!

craig bot commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

erikgrinaker commented Apr 28, 2022

Uh oh!

cockroach-teamcity commented Apr 28, 2022

Uh oh!

tbg commented Apr 28, 2022

Uh oh!

tbg commented Apr 28, 2022

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Apr 29, 2022

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker commented Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craig bot commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

erikgrinaker commented Apr 29, 2022 •

edited

Loading