Skip to content

[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025

Merged
yijiem merged 11 commits intogrpc:masterfrom
yijiem:port_grpcpolledfdposix_fix
Aug 11, 2023
Merged

[EventEngine] Port GrpcPolledFdFactoryPosix fix to EE#34025
yijiem merged 11 commits intogrpc:masterfrom
yijiem:port_grpcpolledfdposix_fix

Conversation

@yijiem
Copy link
Copy Markdown
Member

@yijiem yijiem commented Aug 9, 2023

Port #33871 to EE's GrpcPolledFdFactoryPosix.

@yijiem yijiem added the release notes: no Indicates if PR should not be in release notes label Aug 9, 2023
void (*event_engine_grpc_ares_test_only_inject_config)(ares_channel* channel) =
noop_inject_channel_config;

bool g_event_engine_grpc_ares_test_only_force_tcp = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it ever be possible to compile this and the older grpc_ares_wrapper code into the same binary?

If so we would have an ODR violation here?

I suppose we have the same issue with noop_inject_channel_config ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nevermind I see the name is different

@@ -444,19 +445,32 @@ TEST_F(CancelDuringAresQuery, TestQueryFailsBecauseTcpServerClosesSocket) {
CloseSocketUponReceivingBytesFromPeer);
// TODO(yijiem): make this test work with the EE DNS resolver by supporting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this TODO can be removed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

PosixEventPoller* poller_;
// fds that are used/owned by grpc - we (grpc) will close them rather than
// c-ares
std::unordered_set<ares_socket_t> owned_fds_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One sharp edge on this API that I think becomes apparent with this change:

The GrpcPolledFdFactoryPosix will accumulate memory on each lookup, which won't be free'd until destroyed. So it can't be used as a long-lived object. For example we'd have a leak if the client channel resolver code was changed to create one resolver and keep using it.

To double check, no code will ever use this GrpcPolledFdFactoryPosix as a long-lived object, right? From my reading I think we are good as of today. Wondering if there's any better enforcement than a comment to prevent this kind of mistake in the future though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.

Yeah, I think this is the overhead of this workaround as we shift the responsibility of closing sockets at the end of the request from c-ares to gRPC. And we are bookkeeping these sockets for this purpose. IMO, this has the benefit of simplicity with slight memory overhead, but I wouldn't worry too much about it since the number of sockets that c-ares created is probably low (e.g. maybe 1 udp and 1 tcp socket per-server).

So this might be the upper bound of the size of all sets:

<# of client channels> * <# of servers-per-channel> * 2 sockets per-server

Copy link
Copy Markdown
Contributor

@apolcyn apolcyn Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each client channel resolver indirectly creates a GrpcPolledFdFactoryPosix, so it will be destroyed when the request finishes.

Sounds good. I guess, we just need to make sure it stays that way. cc @markdroth

Just to note, I'm not really concerned about the memory overhead. What I really mean to point out is there there is the possibility that we could have a memory leak if, in the future, we ever mistakenly uses this API as a long-lived object - then we'd have a leak and could eventually OOM or run out of file descriptors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note this isn't a new problem - i.e. this problem exists before this PR anyways because we use ARES_FLAG_STAYOPEN. So most file descriptors will not be closed until the c-ares channel is destroyed anyways.

I just meant that, this PR made the problem more obvious to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also slightly concerned that we keep sockets unclosed until when we are done with the request. Then I realized that it is the existing behavior.

}

GrpcPolledFd* NewGrpcPolledFdLocked(ares_socket_t as) override {
GPR_ASSERT(owned_fds_.insert(as).second);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting NOTE: asserting GPR_ASSERT(owned_fds_.insert(as).second); here is not correct for EE's AresResolver. EE's AresResolver reuses the same channel for all 3 types of requests, so it will very likely use the same socket for different requests. If the previous GrpcPolledFdPosix has been destroyed, it would create a new one, but GrpcPolledFdFactoryPosix has already seen the socket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting. So if I understand we will create a GrpcPolledFdPosix potentially a twice over the same socket?

If so do we have a guarantee that the first GrpcPolledFdPosix will not mutate the socket in such a way that it becomes not reusable? E.g. do we have a guarantee that OrphanHandle and anything else will not call shutdown ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we call OrphanHandle with a valid release_fd pointer and that will prevent the socket from being shutdown or close.

@yijiem yijiem merged commit 67ad297 into grpc:master Aug 11, 2023
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 11, 2023
eugeneo pushed a commit to eugeneo/grpc that referenced this pull request Aug 21, 2023
Port grpc#33871 to EE's
GrpcPolledFdFactoryPosix.

<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/low imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants