[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE by yijiem · Pull Request #34025 · grpc/grpc

yijiem · 2023-08-09T21:07:57Z

Port #33871 to EE's GrpcPolledFdFactoryPosix.

apolcyn · 2023-08-10T02:19:13Z

src/core/lib/event_engine/ares_resolver.cc

 void (*event_engine_grpc_ares_test_only_inject_config)(ares_channel* channel) =
    noop_inject_channel_config;

+bool g_event_engine_grpc_ares_test_only_force_tcp = false;


will it ever be possible to compile this and the older grpc_ares_wrapper code into the same binary?

If so we would have an ODR violation here?

I suppose we have the same issue with noop_inject_channel_config ?

ah nevermind I see the name is different

apolcyn · 2023-08-10T02:22:17Z

test/cpp/naming/cancel_ares_query_test.cc

@@ -444,19 +445,32 @@ TEST_F(CancelDuringAresQuery, TestQueryFailsBecauseTcpServerClosesSocket) {
          CloseSocketUponReceivingBytesFromPeer);
  // TODO(yijiem): make this test work with the EE DNS resolver by supporting


this TODO can be removed

apolcyn · 2023-08-10T02:32:32Z

src/core/lib/event_engine/posix_engine/grpc_polled_fd_posix.h

  PosixEventPoller* poller_;
+  // fds that are used/owned by grpc - we (grpc) will close them rather than
+  // c-ares
+  std::unordered_set<ares_socket_t> owned_fds_;


One sharp edge on this API that I think becomes apparent with this change:

The GrpcPolledFdFactoryPosix will accumulate memory on each lookup, which won't be free'd until destroyed. So it can't be used as a long-lived object. For example we'd have a leak if the client channel resolver code was changed to create one resolver and keep using it.

To double check, no code will ever use this GrpcPolledFdFactoryPosix as a long-lived object, right? From my reading I think we are good as of today. Wondering if there's any better enforcement than a comment to prevent this kind of mistake in the future though.

Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.

Yeah, I think this is the overhead of this workaround as we shift the responsibility of closing sockets at the end of the request from c-ares to gRPC. And we are bookkeeping these sockets for this purpose. IMO, this has the benefit of simplicity with slight memory overhead, but I wouldn't worry too much about it since the number of sockets that c-ares created is probably low (e.g. maybe 1 udp and 1 tcp socket per-server).

So this might be the upper bound of the size of all sets:

<# of client channels> * <# of servers-per-channel> * 2 sockets per-server

Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.

Sounds good. I guess, we just need to make sure it stays that way. cc @markdroth

Just to note, I'm not really concerned about the memory overhead. What I really mean to point out is there there is the possibility that we could have a memory leak if, in the future, we ever mistakenly uses this API as a long-lived object - then we'd have a leak and could eventually OOM or run out of file descriptors.

Also note this isn't a new problem - i.e. this problem exists before this PR anyways because we use ARES_FLAG_STAYOPEN. So most file descriptors will not be closed until the c-ares channel is destroyed anyways.

I just meant that, this PR made the problem more obvious to me.

Yeah, I was also slightly concerned that we keep sockets unclosed until when we are done with the request. Then I realized that it is the existing behavior.

yijiem · 2023-08-11T20:24:06Z

src/core/lib/event_engine/posix_engine/grpc_polled_fd_posix.h

+  }
+
  GrpcPolledFd* NewGrpcPolledFdLocked(ares_socket_t as) override {
+    GPR_ASSERT(owned_fds_.insert(as).second);


Interesting NOTE: asserting GPR_ASSERT(owned_fds_.insert(as).second); here is not correct for EE's AresResolver. EE's AresResolver reuses the same channel for all 3 types of requests, so it will very likely use the same socket for different requests. If the previous GrpcPolledFdPosix has been destroyed, it would create a new one, but GrpcPolledFdFactoryPosix has already seen the socket.

That is interesting. So if I understand we will create a GrpcPolledFdPosix potentially a twice over the same socket?

If so do we have a guarantee that the first GrpcPolledFdPosix will not mutate the socket in such a way that it becomes not reusable? E.g. do we have a guarantee that OrphanHandle and anything else will not call shutdown ?

Yeah, we call OrphanHandle with a valid release_fd pointer and that will prevent the socket from being shutdown or close.

Port grpc#33871 to EE's GrpcPolledFdFactoryPosix.

yijiem added 5 commits August 8, 2023 20:13

add g_event_engine_grpc_ares_test_only_force_tcp

d035b4e

port

c93e2a1

suppose to work but not

e384154

bug

00cd2fa

bug #2

1d78cfc

yijiem added the release notes: no Indicates if PR should not be in release notes label Aug 9, 2023

github-actions bot added lang/c++ lang/core labels Aug 9, 2023

sanitize

b069612

yijiem requested review from apolcyn and drfloob August 9, 2023 21:14

yijiem marked this pull request as ready for review August 9, 2023 21:14

grpc-checks bot added per-call-memory/neutral per-channel-memory/neutral bloat/low labels Aug 9, 2023

fix Windows

d211e40

apolcyn reviewed Aug 10, 2023

View reviewed changes

apolcyn approved these changes Aug 11, 2023

View reviewed changes

yijiem commented Aug 11, 2023

View reviewed changes

yijiem added 4 commits August 11, 2023 13:25

fix

d92d769

fix

032bca4

fix

31cce3f

sanitize

920029f

yijiem merged commit 67ad297 into grpc:master Aug 11, 2023

copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025

[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025
yijiem merged 11 commits intogrpc:masterfrom
yijiem:port_grpcpolledfdposix_fix

yijiem commented Aug 9, 2023

Uh oh!

apolcyn Aug 10, 2023

Uh oh!

apolcyn Aug 10, 2023

Uh oh!

apolcyn Aug 10, 2023

Uh oh!

yijiem Aug 11, 2023

Uh oh!

apolcyn Aug 10, 2023

Uh oh!

yijiem Aug 11, 2023

Uh oh!

apolcyn Aug 11, 2023 •

edited

Loading

Uh oh!

apolcyn Aug 11, 2023

Uh oh!

yijiem Aug 11, 2023

Uh oh!

yijiem Aug 11, 2023

Uh oh!

apolcyn Aug 11, 2023

Uh oh!

yijiem Aug 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -444,19 +445,32 @@ TEST_F(CancelDuringAresQuery, TestQueryFailsBecauseTcpServerClosesSocket) {
		CloseSocketUponReceivingBytesFromPeer);
		// TODO(yijiem): make this test work with the EE DNS resolver by supporting

Conversation

yijiem commented Aug 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apolcyn Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

apolcyn Aug 11, 2023 •

edited

Loading