[CallAttemptTracer] Fix call attempt tracer lifetimes for retries by yashykt · Pull Request #38729 · grpc/grpc

yashykt · 2025-02-12T21:00:37Z

Fix #38728 heap-use-after-free. Details in the issue.

Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream.

Test is being added in #38437

markdroth · 2025-02-13T00:22:38Z

If we are actually seeing a case where the previous attempt has not yet finished handling all of its callbacks before we start the next attempt, then couldn't we also have this same problem with other events in the filter stack of the first attempt that get the call tracer from call context? For example, what if the compression filter calls RecordSendMessage() on the wrong call tracer instance? Or what if the transport calls RecordOutgoingBytes() on the wrong call tracer instance? It seems like this PR fixes the problem only for the client channel filter.

Unfortunately, I'm not sure we have a good general solution for this in the legacy filter stack. But can you please dig into this to get an idea of whether this kind of case is possible and/or likely to occur in practice? If it's very unlikely, then we may have to live with it for now, because I don't see a good alternative right now.

The good news is that this problem shouldn't occur in the new call v3 stack, because that stack will have a separate arena, and therefore a separate call context, for each call attempt.

yashykt · 2025-02-13T01:16:06Z

If we are actually seeing a case where the previous attempt has not yet finished handling all of its callbacks before we start the next attempt, then couldn't we also have this same problem with other events in the filter stack of the first attempt that get the call tracer from call context?

From my reading of the code, we should only start a new transparent attempt after the previous attempt has gotten the trailing metadata. This doesn't mean that we are necessarily safe since we could possibly run into reordering of batches for a call. For unaries, we are fine, since we should be sending a single batch down. For streaming, I'm not sure that we are safe. I can imagine a case where the closure for sending a message is run after we receive the trailing metadata. To fix this, we would either need to -

Use a separate arena for each call attempt as you suggest.
Make sure that all closures are run before we start a new attempt.

I think (2) should be feasible but I'll leave the judgement to you on how reasonable it is. The good thing is that even if we end up using the wrong call attempt tracer instance, after this PR, we will still be recording trace information with the wrong tracer, but we will atleast not crash.

markdroth · 2025-02-13T16:59:07Z

I think (2) would be a very invasive change to the legacy filter stack retry implementation, and it's really not worth trying to do that at this point, since that code is already very complex and hard to reason about, and it's going to be replaced with much simpler code as part of the call v3 migration anyway.

I think there is one other thing we can do, which is to apply the same work-around that this PR uses in every place where we use the call tracer. In other words, in the compression filters and the transport (and anywhere else we may wind up using the call tracer), the filter or transport or whatever can save its own pointer to the call tracer when the call attempt starts and then continue using that pointer instead of accessing it from call context. That will increase per-call memory (just as this PR does), but we'll be able to make that go away when we finish migrating to the call v3 stack.

yashykt · 2025-02-14T03:20:28Z

done

src/core/ext/transport/chttp2/transport/chttp2_transport.cc

markdroth

Thanks for doing this!

…pc#38729) Fix grpc#38728 heap-use-after-free. Details in the issue. Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream. Test is being added in grpc#38437 Closes grpc#38729 COPYBARA_INTEGRATE_REVIEW=grpc#38729 from yashykt:FixCallAttemptTracer cb09add PiperOrigin-RevId: 729307540

…8729) (#38796) Backport #38729 to v1.71.x. (also backporting [#38774](#38774) as a pre-requisite) --------- Co-authored-by: Vignesh Babu <vigneshbabu@google.com>

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries

8be86f9

yashykt requested a review from markdroth as a code owner February 12, 2025 21:00

yashykt added release notes: yes Indicates if PR needs to be in release notes lang/core labels Feb 12, 2025

grpc-checks bot added bloat/none per-call-memory/neutral per-channel-memory/neutral labels Feb 12, 2025

Explicitly save call attempt tracer wherever we are using it

db2a238

yashykt requested a review from ctiller as a code owner February 14, 2025 02:30

Fix compilation

c8d7905

grpc-checks bot added bloat/improvement and removed bloat/none labels Feb 14, 2025

Fixes

3ff54ed

grpc-checks bot added per-channel-memory/increase and removed per-channel-memory/neutral labels Feb 14, 2025

Fixes

936b4a2

yashykt added the disposition/DO NOT MERGE label Feb 14, 2025

grpc-checks bot added per-channel-memory/neutral and removed per-channel-memory/increase labels Feb 14, 2025

yashykt removed the disposition/DO NOT MERGE label Feb 14, 2025

markdroth reviewed Feb 19, 2025

View reviewed changes

src/core/ext/transport/chttp2/transport/chttp2_transport.cc Show resolved Hide resolved

src/core/ext/transport/chttp2/transport/chttp2_transport.cc Show resolved Hide resolved

yashykt added 2 commits February 20, 2025 00:24

Merge branch 'master' into FixCallAttemptTracer

8a84e1a

Revert unintentional change

7486fd1

grpc-checks bot added bloat/none and removed bloat/improvement labels Feb 20, 2025

Add experiment

48fb0ae

yashykt requested review from gnossen and veblush as code owners February 20, 2025 21:10

yashykt added 2 commits February 20, 2025 21:15

fix

e0cb069

fixes

05623d2

markdroth approved these changes Feb 20, 2025

View reviewed changes

Fix tests

cb09add

github-actions bot added the lang/c++ label Feb 20, 2025

markdroth approved these changes Feb 21, 2025

View reviewed changes

copybara-service bot closed this in 211f9ec Feb 21, 2025

yashykt mentioned this pull request Feb 21, 2025

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries (#38729) #38796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729
yashykt wants to merge 11 commits intogrpc:masterfrom
yashykt:FixCallAttemptTracer

yashykt commented Feb 12, 2025 •

edited

Loading

Uh oh!

markdroth commented Feb 13, 2025

Uh oh!

yashykt commented Feb 13, 2025

Uh oh!

markdroth commented Feb 13, 2025

Uh oh!

yashykt commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

markdroth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yashykt commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdroth commented Feb 13, 2025

Uh oh!

yashykt commented Feb 13, 2025

Uh oh!

markdroth commented Feb 13, 2025

Uh oh!

yashykt commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

markdroth left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yashykt commented Feb 12, 2025 •

edited

Loading