-
Notifications
You must be signed in to change notification settings - Fork 11.1k
UnimplementedAsyncRequest constructor races with GenericAsyncRequest causing crashes #32356
Description
What version of gRPC and what language are you using?
The bug seems to be very old, so the exact version does not matter.
What operating system (Linux, Windows,...) and version?
Linux
What runtime / compiler are you using (e.g. python version or version of gcc)
We use clang, but the compiler doesn't matter.
What did you do?
We call cq->Next(&tag, ok) from multiple threads.
What did you expect to see?
We expect to only ever see values of tags that we provided.
What did you see instead?
Recently we started to occasionally see our grpc servers crashing right after the call to cq->Next. Debugging showed, that Next returns true, indicating the event is available, however tag turned out to be nullptr. Since we cast it to our interface and call a method on it, this caused crashes when loading a vptr. Further debugging showed that the call to FinalizeResult in CompletionQueue::AsyncNextInternal returns true, but the tag is nullptr. When inspecting core_cq_tag we observed the object is actually UnimplementedAsyncRequest, however UnimplementedAsyncRequest::FinalizeResult should always returns false, so it was puzzling at first how it could ever return true.
Turns out there's a race in UnimplementedAsyncRequest constructor. When it calls to GenericAsyncRequest, it starts the request right there, however the vptr is pointing to GenericAsyncRequest at that time. Later UnimplementedAsyncRequest updates vptr to its vtable, but it looks like the request may already start being handled on another thread, and AsyncNextInternal manages to catch it in a state when vptr is still pointing to GenericAsyncRequest, calling GenericAsyncRequest::FinalizeResult. It returns true, bypassing UnimplementedAsyncRequest override, and it indeed has tag initialized to nullptr by default.
It looks like our users tried using a new grpc service, which was not served by our server yet, this triggered a lot of unimplemented requests, which started triggering the race under heavy load. Workaround of delaying grpc_server_request_call in GenericAsyncRequest until UnimplementedAsyncRequest constructor body stopped crashes for us, confirming that the problem is indeed with UnimplementedAsyncRequest.