[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729
[CallAttemptTracer] Fix call attempt tracer lifetimes for retries#38729yashykt wants to merge 11 commits intogrpc:masterfrom
Conversation
|
If we are actually seeing a case where the previous attempt has not yet finished handling all of its callbacks before we start the next attempt, then couldn't we also have this same problem with other events in the filter stack of the first attempt that get the call tracer from call context? For example, what if the compression filter calls Unfortunately, I'm not sure we have a good general solution for this in the legacy filter stack. But can you please dig into this to get an idea of whether this kind of case is possible and/or likely to occur in practice? If it's very unlikely, then we may have to live with it for now, because I don't see a good alternative right now. The good news is that this problem shouldn't occur in the new call v3 stack, because that stack will have a separate arena, and therefore a separate call context, for each call attempt. |
From my reading of the code, we should only start a new transparent attempt after the previous attempt has gotten the trailing metadata. This doesn't mean that we are necessarily safe since we could possibly run into reordering of batches for a call. For unaries, we are fine, since we should be sending a single batch down. For streaming, I'm not sure that we are safe. I can imagine a case where the closure for sending a message is run after we receive the trailing metadata. To fix this, we would either need to -
I think (2) should be feasible but I'll leave the judgement to you on how reasonable it is. The good thing is that even if we end up using the wrong call attempt tracer instance, after this PR, we will still be recording trace information with the wrong tracer, but we will atleast not crash. |
|
I think (2) would be a very invasive change to the legacy filter stack retry implementation, and it's really not worth trying to do that at this point, since that code is already very complex and hard to reason about, and it's going to be replaced with much simpler code as part of the call v3 migration anyway. I think there is one other thing we can do, which is to apply the same work-around that this PR uses in every place where we use the call tracer. In other words, in the compression filters and the transport (and anywhere else we may wind up using the call tracer), the filter or transport or whatever can save its own pointer to the call tracer when the call attempt starts and then continue using that pointer instead of accessing it from call context. That will increase per-call memory (just as this PR does), but we'll be able to make that go away when we finish migrating to the call v3 stack. |
|
done |
…pc#38729) Fix grpc#38728 heap-use-after-free. Details in the issue. Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream. Test is being added in grpc#38437 Closes grpc#38729 COPYBARA_INTEGRATE_REVIEW=grpc#38729 from yashykt:FixCallAttemptTracer cb09add PiperOrigin-RevId: 729307540
Fix #38728 heap-use-after-free. Details in the issue.
Also, fix a bug in chttp2 where we are using the parent call tracer instead of the call attempt tracer to record annotations for a stream.
Test is being added in #38437