Fix race in bm_chttp2_transport#20091
Conversation
|
|
| gpr_event* stream_cancel_done = new gpr_event; | ||
| gpr_event_init(stream_cancel_done); | ||
| std::unique_ptr<Closure> stream_cancel_closure = MakeClosure( | ||
| [&](grpc_error* error) { gpr_event_set(stream_cancel_done, (void*)1); }); |
There was a problem hiding this comment.
Nit: Should use static_cast instead of a c-style cast.
There was a problem hiding this comment.
Also: grpc_error* error is being captured, but not used - is it needed?
There was a problem hiding this comment.
Where is the c-style cast? Note that I am following the convention of the rest of this file.
I added an assert for the grpc_error. Thanks for catching that.
There was a problem hiding this comment.
oh, you mean the (void*)1? I can change that if you feel strongly, but note that it is currently consistent with the rest of this file.
There was a problem hiding this comment.
In this case, my preference is to go with what the style guide says (eschewing the use of C-casts).
There was a problem hiding this comment.
Hm so the static_cast fails to compile: https://source.cloud.google.com/results/invocations/4303f8e3-d019-486c-8573-f251f4450aa1/targets/grpc%2Fcore%2Fpull_request%2Flinux%2Fgrpc_bazel_build/log
There was a problem hiding this comment.
Considering static_cast is failing to compile and the c-style casts are scattered throughout our tests, I'm filing a bug to do this cleanup later and submitting this PR for now.
There was a problem hiding this comment.
Turns out there was already a bug filed for this. Re-opening: #14392
There was a problem hiding this comment.
Ah, right - since it's a integer it would need to be a reinterpret_cast.
There was a problem hiding this comment.
Yeah I think I'll do that in a followup.
|
@arjunroy I added some changes to solve another bug in the benchmark; namely, waiting for the stream to be cancelled before attempting to destroy it. PTAL. |
1dae715 to
f171523
Compare
f171523 to
6dfe27a
Compare

This PR fixes a race in bm_chttp2_transport whereby the transport op was going out of scope while perform_stream_op_locked was trying to access it. The issue is that the benchmark was not waiting for the cancel stream op to complete before exiting. There's also another issue of trying to destroy the stream before the stream has been cancelled.
Fixes #19877
Note to self: I suspect the benchmarks that I didn't modify may suffer from this race as well, but they're of a different structure from the benchmark that's actually failing in our tests, so I have to experiment more with how to fix them. For now, I fixed 3/5 of the benchmarks, and it's important to get this fix in soon since the bug causes near-continuous failures on master.