kvclient: fix gRPC stream leak in rangefeed client#80705
kvclient: fix gRPC stream leak in rangefeed client#80705craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
When the DistSender rangefeed client received a `RangeFeedError` message and propagated a retryable error up the stack, it would fail to close the existing gRPC stream, causing stream/goroutine leaks. Release note (bug fix): Fixed a goroutine leak when internal rangefeed clients received certain kinds of retriable errors.
|
TIMEOUT: //pkg/kv/kvserver/rangefeed:rangefeed_test (Summary) |
|
=== RUN TestProcessorMemoryBudgetExceeded |
srosenberg
left a comment
There was a problem hiding this comment.
Reviewed all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):
// Ensure context is cancelled on all errors, to prevent gRPC stream leaks. ctx, cancel := context.WithCancel(ctx) defer cancel()
- What if there were no errors? In case of EOF, this function returns
args.Timestamp, nil. Won't canceling in this case affect other goroutines doingpartialRangeFeed? - Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
- Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate? E.g., is there a better way of documenting a context that the callee must cancel in case of an error? could we add a sweeper which cancels groups which are found to be blocked on
ctx.Done?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):
What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.
Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
We check for goroutine leaks in tests using leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.
We could write a very targeted regression test for singleRangeFeed that sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.
Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?
leaktest already accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).
is there a better way of documenting a context that the callee must cancel in case of an error?
This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:
could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?
How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?
I stressed this for 100k runs both with and without the race detector and didn't see any failures. Will bake this on |
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
What if there were no errors? In case of EOF, this function returns args.Timestamp, nil. Won't canceling in this case affect other goroutines doing partialRangeFeed?
Contexts are immutable. We're creating a new context here which is scoped to this function and propagated down to callees. It doesn't affect the caller's context.
Any ideas as to why this leak wasn't picked up by any of the unit tests? I am guessing it's because we don't have a specific failpoint which forces an error path. Could we add a regression for this?
We check for goroutine leaks in tests using
leaktest, which compares snapshots of goroutines before and after a test. However, the leak here is temporary -- the streams/goroutines will get cleaned up when the overall rangefeed terminates (when the top-level context cancels), they only leak while the rangefeed is running. It is not clear to me how we would detect this in general (how do we distinguish between valid and invalid goroutines during test execution?), especially not when they're spawned by third-party libraries.We could write a very targeted regression test for
singleRangeFeedthat sets up a convoluted scenario to check for this one specific leak. However, I'm not sure if that test is going to bring any additional value over the code comment, since it would essentially just test whether someone removes that context cancellation line. It seems too narrow in scope, but I'll add one if you think it's valuable.Looking at this code, I am a bit terrified that we probably have many other instances of this bug. What could we do to both reduce the chance of introducing the error and increase the detection rate?
leaktestalready accounts for the most egregious offenses, and is mandated for all tests via a linter. We could possibly write something specific to gRPC streams, maybe using a gRPC interceptor or something to keep track of expected streams -- but the way the caller would signal that it's done with a gRPC stream is to cancel the context, so it's a bit of a catch-22. I don't immediately have any good ideas for how we would go about this without better language support (e.g. in Rust it might be possible to enforce this at compile-time).is there a better way of documenting a context that the callee must cancel in case of an error?
This isn't always the case -- if the error came from the gRPC stream itself then the context doesn't need to be cancelled, and it may not always be appropriate/desirable to cancel it. See the gRPC stream contract here:
could we add a sweeper which cancels groups which are found to be blocked on ctx.Done?
How would we distinguish between the goroutines we want to cancel and the ones we don't? How would that be different from just cancelling the relevant context?
PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
PS: I feel like this sort of thing might be best tested by long-lived end-to-end full-cluster tests. In a steady state, we would expect goroutine counts to remain relatively stable, so we could e.g. assert that goroutine counts (as exposed via metrics) should not increase by too much over some timespan.
Thanks Erik! From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events. In fact, that's what the above defer is doing. I'll think a bit more about the sweeper for gRPC goroutines. I do agree the long-lived end-to-end cluster testing would be an effective way of detecting this leak.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @srosenberg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 393 at r1 (raw file):
From the gRPC stream contract, 2) Cancel the context provided, seems to be allowed even in the presence of other events.
Yes, they're not mutually exclusive. If the stream has already been terminated by other means, the cancelling the context has no effect on it.
|
TFTRs! bors r=knz,tbg,srosenberg |
|
Build succeeded: |
When the DistSender rangefeed client received a
RangeFeedErrormessageand propagated a retryable error up the stack, it would fail to close
the existing gRPC stream, causing stream/goroutine leaks.
Release note (bug fix): Fixed a goroutine leak when internal rangefeed
clients received certain kinds of retriable errors.