Skip to content

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729

Closed
yashykt wants to merge 11 commits intogrpc:masterfrom
yashykt:FixCallAttemptTracer
Closed

[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729
yashykt wants to merge 11 commits intogrpc:masterfrom
yashykt:FixCallAttemptTracer

Conversation

@yashykt
Copy link
Copy Markdown
Member

@yashykt yashykt commented Feb 12, 2025

Fix #38728 heap-use-after-free. Details in the issue.

Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream.

Test is being added in #38437

@markdroth
Copy link
Copy Markdown
Member

If we are actually seeing a case where the previous attempt has not yet finished handling all of its callbacks before we start the next attempt, then couldn't we also have this same problem with other events in the filter stack of the first attempt that get the call tracer from call context? For example, what if the compression filter calls RecordSendMessage() on the wrong call tracer instance? Or what if the transport calls RecordOutgoingBytes() on the wrong call tracer instance? It seems like this PR fixes the problem only for the client channel filter.

Unfortunately, I'm not sure we have a good general solution for this in the legacy filter stack. But can you please dig into this to get an idea of whether this kind of case is possible and/or likely to occur in practice? If it's very unlikely, then we may have to live with it for now, because I don't see a good alternative right now.

The good news is that this problem shouldn't occur in the new call v3 stack, because that stack will have a separate arena, and therefore a separate call context, for each call attempt.

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Feb 13, 2025

If we are actually seeing a case where the previous attempt has not yet finished handling all of its callbacks before we start the next attempt, then couldn't we also have this same problem with other events in the filter stack of the first attempt that get the call tracer from call context?

From my reading of the code, we should only start a new transparent attempt after the previous attempt has gotten the trailing metadata. This doesn't mean that we are necessarily safe since we could possibly run into reordering of batches for a call. For unaries, we are fine, since we should be sending a single batch down. For streaming, I'm not sure that we are safe. I can imagine a case where the closure for sending a message is run after we receive the trailing metadata. To fix this, we would either need to -

  1. Use a separate arena for each call attempt as you suggest.
  2. Make sure that all closures are run before we start a new attempt.

I think (2) should be feasible but I'll leave the judgement to you on how reasonable it is. The good thing is that even if we end up using the wrong call attempt tracer instance, after this PR, we will still be recording trace information with the wrong tracer, but we will atleast not crash.

@markdroth
Copy link
Copy Markdown
Member

I think (2) would be a very invasive change to the legacy filter stack retry implementation, and it's really not worth trying to do that at this point, since that code is already very complex and hard to reason about, and it's going to be replaced with much simpler code as part of the call v3 migration anyway.

I think there is one other thing we can do, which is to apply the same work-around that this PR uses in every place where we use the call tracer. In other words, in the compression filters and the transport (and anywhere else we may wind up using the call tracer), the filter or transport or whatever can save its own pointer to the call tracer when the call attempt starts and then continue using that pointer instead of accessing it from call context. That will increase per-call memory (just as this PR does), but we'll be able to make that go away when we finish migrating to the call v3 stack.

@yashykt yashykt requested a review from ctiller as a code owner February 14, 2025 02:30
@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Feb 14, 2025

done

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

yashykt added a commit to yashykt/grpc that referenced this pull request Feb 21, 2025
…pc#38729)

Fix grpc#38728 heap-use-after-free. Details in the issue.

Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream.

Test is being added in grpc#38437

Closes grpc#38729

COPYBARA_INTEGRATE_REVIEW=grpc#38729 from yashykt:FixCallAttemptTracer cb09add
PiperOrigin-RevId: 729307540
yashykt added a commit that referenced this pull request Feb 21, 2025
…8729) (#38796)

Backport #38729 to v1.71.x. (also backporting
[#38774](#38774) as a pre-requisite)

---------

Co-authored-by: Vignesh Babu <vigneshbabu@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CallAttemptTracer] heap-use-after-free when testing transparent retries

2 participants