Skip to content

grpcinterceptor: Use ctxutil to detect context cancellation#107079

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:grpc
Jul 20, 2023
Merged

grpcinterceptor: Use ctxutil to detect context cancellation#107079
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:grpc

Conversation

@miretskiy
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy commented Jul 18, 2023

Previous commit #76306 arranged for interceptors to run for all rpcs (local or otherwise). In doing so, it added an additional go routine for each RPC whose purpose was to detect context completion, and to arrange for the cleanup function to run when that happened.

ctxutil.WhenDone (added by #103866/commits/dfe4ef82810aaf81ae21afecce83cbc1fb6783dd) allows for the same functionality without the additional goroutine cost.

Informs #107053

Release note: None

@miretskiy miretskiy requested review from a team, Santamaura and erikgrinaker and removed request for a team July 18, 2023 14:39
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 18, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@miretskiy miretskiy force-pushed the grpc branch 2 times, most recently from 64ca14b to 9144c8c Compare July 18, 2023 15:12
@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 19, 2023

We have two problems here.
@yuzefovich as the reviewer of #103866 I'd like you to double check what I write below.

1. Goroutine saving - are you sure?

allows for the same functionality without the additional goroutine cost.

This is not true in general. Have you opened the black box? Look at the go source code, this propagateCancel spawns a goroutine to do the job:

             atomic.AddInt32(&goroutines, +1)
             go func() {
                     select {
                     case <-parent.Done():
                             child.cancel(false, parent.Err())
                     case <-child.Done():
                     }
             }()

2. Possible goroutine leak!

ctxutil.WhenDone (added by dfe4ef8)

I double checked this and I don't like what I am seeing.

If you look at the go source code for this internal propagateCancel" function, you see that it spawns a goroutine. That goroutine only ever terminates if either the child or parent context is canceled.

So what if they don't get canceled at all? Like, if our code paths pass a context.Background or similar?

At the very least with our code base, when someone spawns a goroutine they know they have to listen on the stopper ShouldQuiesce channel. This is also encouraged in the API of our retry package.

This new ctxutil.WhenDone does not encourage that. Worse yet, it terminates immediately (i.e. it does not block execution in the caller) which means there is no synchronization on termination of that implicit goroutine.

Recommendation: remove ctxutil altogether

If the above is confirmed, I would recommend removing the current impl of ctxutil altogether.

If we really need this functionality, we need to build a patch to the go runtime to do what we want (we have a patch infrastructure now! let's use it!) while ensuring that stray goroutines don't leak.

@erikgrinaker
Copy link
Copy Markdown
Contributor

Thanks for looking into this @knz, but I think your concerns can be assuaged. WhenDone() can only take a context with Done() != nil, i.e. a cancellable context derived from WithCancel or WithTimeout:

if parent.Done() == nil {
return ErrNeverCompletes
}

If you look at the go source code for this internal propagateCancel" function, you see that it spawns a goroutine.

Only if the parent context isn't cancellable. We've already established that it must be, so it takes this code path here, which doesn't spawn a goroutine:

https://github.com/golang/go/blob/5fe3f0a265c90a9c0346403742c6cafeb154503b/src/context/context.go#L473-L487

So what if they don't get canceled at all? Like, if our code paths pass a context.Background or similar?

We've established that the context must be cancellable, so the caller is responsible for avoiding goroutine leaks by calling the associated cancel function.

I think the only concern here would be if we use a custom context implementation that overrides Done(), in which case parentCancelCtx won't be able to resolve the parent cancelCtx and attach to it.

That said, I'm not super-thrilled about accessing Go internals like this. This stuff is internal for a reason.

I'll give this a closer review.

@erikgrinaker
Copy link
Copy Markdown
Contributor

FWIW, I also verified that this gets rid of the superfluous goroutine.

@miretskiy
Copy link
Copy Markdown
Contributor Author

@knz -- what @erikgrinaker said.

Have you opened the black box?
Obviously, I have in order to implement ctxutil package. If you do some shenanigans and override context, then
the default code path is taken that does spin up goroutine. But, we don't do that; and in the common case of
context.WithTimeout/WithCancel, the code does not spawn anything.

If we really need this functionality, we need to build a patch to the go runtime to do what we want (we have a patch infrastructure now! let's use it!) while ensuring that stray goroutines don't leak.

Goroutines do not leak with this change. The behavior is no different from the default implementation of context
which spins up go routine if the implementation is anything other than the 2 "approved" cancelers.
The last time context.go changed was the first time go was released. I think a ~20 year lifetime is not that terrible, and of course, we could try patching things locally.

re @erikgrinaker

That said, I'm not super-thrilled about accessing Go internals like this. This stuff is internal for a reason.
Me neither; Having said that, a) this particular implementation does not change frequently -- first time in 1.20; b) we do occasionally access those internal implementations ("git grep linkname") -- it's not done frequently, nor in a wonton way.
If there was another way to do it, I would.

I have searched quite extensively on this topic; there were proposals to extend context.go somehow; there was always a pushback.

@miretskiy
Copy link
Copy Markdown
Contributor Author

We have two problems here.
@yuzefovich as the reviewer of #103866 I'd like you to double check what I write below.

@knz -- I don't think there are two problems here.
The reason why the change reviewed by @yuzefovich was backed out is that while
this implementation is faster than checking for ->Done() (~8x), it is not 128 (or even 1024) times faster since
that's how frequent the Done checks are in vectorized engine.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, and gets the job done.

Somewhat uneasy about ctxutil.WhenDone(), especially considering it's otherwise unused. Can we add in some CrdbTestBuild assertions ensuring that it always does what it's supposed to, either here or separately? We'll also want to let this bake for a while before backporting.

// [1] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
finishFunc(nil /* err */)
}); err != nil && !errors.Is(err, ctxutil.ErrNeverCompletes) {
panic(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's propagate the error. I know this can't currently happen, but still, I don't like latent panics and the only caller has an error return path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic reworked.

@knz
Copy link
Copy Markdown
Contributor

knz commented Jul 19, 2023

I am friendly to the change now. You have successfully changed my opinion.

@miretskiy
Copy link
Copy Markdown
Contributor Author

@erikgrinaker @knz
Please take another look. I expanded test coverage in ctxutil package; Added utility function to check
if we can directly propagate context, and added a check which triggers panic under test.

@miretskiy miretskiy requested a review from erikgrinaker July 19, 2023 17:31
@miretskiy miretskiy force-pushed the grpc branch 3 times, most recently from 19c1fb2 to 667bcf9 Compare July 19, 2023 20:10
Previous commit cockroachdb#76306 arranged for interceptors to run
for all rpcs (local or otherwise).  In doing so, it added
an additional goroutine for each RPC whose purpose was to
detect context completion, and to arrange for the cleanup
function to run when that happened.

`ctxutil.WhenDone` (added by cockroachdb#103866/commits/dfe4ef82810aaf81ae21afecce83cbc1fb6783dd)
allows for the same functionality without the additional
goroutine cost.

Informs cockroachdb#107053

Release note: None
@miretskiy
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 20, 2023

Build succeeded:

@craig craig bot merged commit 39124c4 into cockroachdb:master Jul 20, 2023
@miretskiy miretskiy added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Jul 20, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 2, 2023
…tion

Backport 1/2 commits from
cockroachdb#103866
Backport 1/1 commits from
cockroachdb#107079

Previous commit cockroachdb#76306 arranged for interceptors to run
for all rpcs (local or otherwise).  In doing so, it added
an additional goroutine for each RPC whose purpose was to
detect context completion, and to arrange for the cleanup
function to run when that happened.

`ctxutil.WhenDone` (added by cockroachdb#103866/commits/dfe4ef82810aaf81ae21afecce83cbc1fb6783dd)
allows for the same functionality without the additional
goroutine cost.

Informs cockroachdb#107053

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 2, 2023
…tion

Backport 1/2 commits from
cockroachdb#103866
Backport 1/1 commits from
cockroachdb#107079

Previous commit cockroachdb#76306 arranged for interceptors to run
for all rpcs (local or otherwise).  In doing so, it added
an additional goroutine for each RPC whose purpose was to
detect context completion, and to arrange for the cleanup
function to run when that happened.

`ctxutil.WhenDone` (added by cockroachdb#103866/commits/dfe4ef82810aaf81ae21afecce83cbc1fb6783dd)
allows for the same functionality without the additional
goroutine cost.

Informs cockroachdb#107053

Release note: None
Release justification: Important performance improvement for large scale
rangefeed workloads to remove inadvertently introduced goroutine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants