[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025
[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025yijiem merged 11 commits intogrpc:masterfrom
Conversation
| void (*event_engine_grpc_ares_test_only_inject_config)(ares_channel* channel) = | ||
| noop_inject_channel_config; | ||
|
|
||
| bool g_event_engine_grpc_ares_test_only_force_tcp = false; |
There was a problem hiding this comment.
will it ever be possible to compile this and the older grpc_ares_wrapper code into the same binary?
If so we would have an ODR violation here?
I suppose we have the same issue with noop_inject_channel_config ?
There was a problem hiding this comment.
ah nevermind I see the name is different
| @@ -444,19 +445,32 @@ TEST_F(CancelDuringAresQuery, TestQueryFailsBecauseTcpServerClosesSocket) { | |||
| CloseSocketUponReceivingBytesFromPeer); | |||
| // TODO(yijiem): make this test work with the EE DNS resolver by supporting | |||
| PosixEventPoller* poller_; | ||
| // fds that are used/owned by grpc - we (grpc) will close them rather than | ||
| // c-ares | ||
| std::unordered_set<ares_socket_t> owned_fds_; |
There was a problem hiding this comment.
One sharp edge on this API that I think becomes apparent with this change:
The GrpcPolledFdFactoryPosix will accumulate memory on each lookup, which won't be free'd until destroyed. So it can't be used as a long-lived object. For example we'd have a leak if the client channel resolver code was changed to create one resolver and keep using it.
To double check, no code will ever use this GrpcPolledFdFactoryPosix as a long-lived object, right? From my reading I think we are good as of today. Wondering if there's any better enforcement than a comment to prevent this kind of mistake in the future though.
There was a problem hiding this comment.
Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.
Yeah, I think this is the overhead of this workaround as we shift the responsibility of closing sockets at the end of the request from c-ares to gRPC. And we are bookkeeping these sockets for this purpose. IMO, this has the benefit of simplicity with slight memory overhead, but I wouldn't worry too much about it since the number of sockets that c-ares created is probably low (e.g. maybe 1 udp and 1 tcp socket per-server).
So this might be the upper bound of the size of all sets:
<# of client channels> * <# of servers-per-channel> * 2 sockets per-server
There was a problem hiding this comment.
Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.
Sounds good. I guess, we just need to make sure it stays that way. cc @markdroth
Just to note, I'm not really concerned about the memory overhead. What I really mean to point out is there there is the possibility that we could have a memory leak if, in the future, we ever mistakenly uses this API as a long-lived object - then we'd have a leak and could eventually OOM or run out of file descriptors.
There was a problem hiding this comment.
Also note this isn't a new problem - i.e. this problem exists before this PR anyways because we use ARES_FLAG_STAYOPEN. So most file descriptors will not be closed until the c-ares channel is destroyed anyways.
I just meant that, this PR made the problem more obvious to me.
There was a problem hiding this comment.
Yeah, I was also slightly concerned that we keep sockets unclosed until when we are done with the request. Then I realized that it is the existing behavior.
| } | ||
|
|
||
| GrpcPolledFd* NewGrpcPolledFdLocked(ares_socket_t as) override { | ||
| GPR_ASSERT(owned_fds_.insert(as).second); |
There was a problem hiding this comment.
Interesting NOTE: asserting GPR_ASSERT(owned_fds_.insert(as).second); here is not correct for EE's AresResolver. EE's AresResolver reuses the same channel for all 3 types of requests, so it will very likely use the same socket for different requests. If the previous GrpcPolledFdPosix has been destroyed, it would create a new one, but GrpcPolledFdFactoryPosix has already seen the socket.
There was a problem hiding this comment.
That is interesting. So if I understand we will create a GrpcPolledFdPosix potentially a twice over the same socket?
If so do we have a guarantee that the first GrpcPolledFdPosix will not mutate the socket in such a way that it becomes not reusable? E.g. do we have a guarantee that OrphanHandle and anything else will not call shutdown ?
There was a problem hiding this comment.
Yeah, we call OrphanHandle with a valid release_fd pointer and that will prevent the socket from being shutdown or close.
Port grpc#33871 to EE's GrpcPolledFdFactoryPosix. <!-- If you know who should review your pull request, please assign it to that person, otherwise the pull request would get assigned randomly. If your pull request is for a specific language, please add the appropriate lang label. -->
Port #33871 to EE's GrpcPolledFdFactoryPosix.