Skip to content

[windows DNS] Simplify c-ares Windows code#33965

Merged
apolcyn merged 40 commits intogrpc:masterfrom
apolcyn:simpler_windows_dns
Aug 30, 2023
Merged

[windows DNS] Simplify c-ares Windows code#33965
apolcyn merged 40 commits intogrpc:masterfrom
apolcyn:simpler_windows_dns

Conversation

@apolcyn
Copy link
Copy Markdown
Contributor

@apolcyn apolcyn commented Aug 2, 2023

A set of simplifications to make this code easier to reason about:

  • Replace SockToPolledFdMap with std::map

  • Make the c-ares close callback do nothing. Instead, let the ares wrapper code destroy polled fds as it normally does, and let everything that hasn't been registered for I/O get destroyed in the GrpcPolledFdFactoryWindows dtor.

  • Get rid of GrpcPolledFdWindowsWrapper

  • Move socket_notify_on_write to the RegisterForOnWriteableLocked method. This makes for a nice invariant that no async callback is pending unless a RegisterForOnWriteableLocked or RegisterForOnReadableLocked callback is pending.

Related: internal issue b/293321613

@apolcyn apolcyn added release notes: yes Indicates if PR needs to be in release notes release notes: no Indicates if PR should not be in release notes and removed release notes: yes Indicates if PR needs to be in release notes release notes: no Indicates if PR should not be in release notes labels Aug 2, 2023
@apolcyn apolcyn marked this pull request as ready for review August 3, 2023 15:51
@apolcyn apolcyn requested a review from markdroth as a code owner August 3, 2023 15:51
Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex. While I understand the simplification of changing from using async WSA send to sync WSA send on a non-blocking socket, I'm wondering why are we making this change since we have already developed the machinery for doing the async WSA send with IOCP. Is there a bug in our implementation?

GetName(), buf.len, bytes_sent_ptr != nullptr ? *bytes_sent_ptr : 0,
overlapped, out, *wsa_error_code);
return out;
last_wsa_send_result_, *wsa_error_code);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ret, *wsa_error_code);?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this change so this is the way it was before

"fd:|%s| created with params af:%d type:%d protocol:%d",
polled_fd->GetName(), af, type, protocol);
map->AddNewSocket(s, polled_fd);
auto insert_result = self->sockets_.insert({s, polled_fd});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe just one line GPR_ASSERT(self->sockets_.insert({s, polled_fd}).second); ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


GrpcPolledFdWindows(ares_socket_t as, Mutex* mu, int address_family,
int socket_type)
int socket_type, std::function<void()> on_shutdown_locked)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer absl::AnyInvocable<void()> over std::function<void()>: go/totw/191.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (!connect_done_) {
GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_);
pending_continue_register_for_on_writeable_locked_ = true;
grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_);
Copy link
Copy Markdown
Member

@yijiem yijiem Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this here? AFAICT, ConnectTCP() already calls grpc_socket_notify_on_write() for an async connect: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc#L567

Copy link
Copy Markdown
Contributor Author

@apolcyn apolcyn Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with grpc_socket_notify_on_write within the connect callback is that we are registering an async callback without taking a ref.

When RegisterForOnWriteableLocked is called, the c-ares wrapper code takes a matching ref which will not get released until the writable callback is called. We should be holding a ref around the connect callback, so it's convenient to piggyback on the ref held over RegisterForOnWriteableLocked.

Copy link
Copy Markdown
Member

@yijiem yijiem Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Suggest add a comment here: // Register an async OnTcpConnect callback here since we are guaranteed to hold a ref of the c-ares wrapper before write_closure_ is called. .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment

@yijiem yijiem requested a review from drfloob August 4, 2023 22:11
@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Aug 23, 2023

Thanks Alex. While I understand the simplification of changing from using async WSA send to sync WSA send on a non-blocking socket, I'm wondering why are we making this change since we have already developed the machinery for doing the async WSA send with IOCP. Is there a bug in our implementation?

Yeah, doing that in this PR too might have been overzealous and it's independent from the rest of the changes here. I've reverted back to the async WSA send approach (I don't actually see a bug in this code, it's just a lot of machinery)

#include "src/core/lib/iomgr/sockaddr_windows.h"
#include "src/core/lib/iomgr/socket_windows.h"
#include "src/core/lib/iomgr/tcp_windows.h"
#include "src/core/lib/iomgr/timer.h"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

// grpc_winsocket_shutdown calls closesocket which invalidates our
// socket -> polled_fd mapping because the socket handle can be henceforth
// reused.
self->sockets_.erase(s);
Copy link
Copy Markdown
Member

@yijiem yijiem Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we leak memory of the GrpcPolledFdWindows without a delete? Can we use std::map<SOCKET, std::unique_ptr<GrpcPolledFdWindows>> instead so that we don't need to worry about having to call delete explicitly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are erasing the polled fd from the map, then it's because ShutdownLocked was called on the polled fd. This means that the c-ares wrapper actually owns the polled fd - it deletes it here.

Note that the polled fd factory only deletes those polled fds not still remaining in the map at the end of the resolution - i.e. polled fds created by the c-ares library and not ever seen by the c-ares grpc wrapper.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the explanation! I have follow-up questions: Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()? In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?

Copy link
Copy Markdown
Contributor Author

@apolcyn apolcyn Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()?

Yes, this is the practical case I'm thinking of.

In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?

We could do this, but I don't see the benefit, and it would add extra complexity because we would need to track additional state per polled fd: something like bool owned_by_c_ares_lib_ which would start out true and be set false in NewGrpcPolledFdLocked.

CloseSocket would then need to check owned_by_c_ares_lib_, and if true we'd need to:

  1. shut down the endpoint

  2. remove the entry from the sockets_ map

  3. delete the endpoint

IMO, that would be more complex than the current model of this PR, where we just shutdown/destroy everything left in sockets_ at the end.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'm all for simplifying this code.

int ConnectUDP(WSAErrorContext* wsa_error_ctx, const struct sockaddr* target,
ares_socklen_t target_len) {
GRPC_CARES_TRACE_LOG("fd:%s ConnectUDP", GetName());
GPR_ASSERT(!connect_done_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry no real reason - undid this change

if (!connect_done_) {
GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_);
pending_continue_register_for_on_writeable_locked_ = true;
grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_);
Copy link
Copy Markdown
Member

@yijiem yijiem Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Aug 25, 2023

Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

I don't actually have a repro of the original bugs seen in the wild, and I actually have not repro'd a crash here yet.

Just reasoning about the code though - before this change there is nothing to prevent the TcpConnect callback from being invoked after the polled fd is destroyed.

This reverts commit fad4beb.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/none imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral platform/Windows release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants