[windows DNS] Simplify c-ares Windows code#33965
Conversation
yijiem
left a comment
There was a problem hiding this comment.
Thanks Alex. While I understand the simplification of changing from using async WSA send to sync WSA send on a non-blocking socket, I'm wondering why are we making this change since we have already developed the machinery for doing the async WSA send with IOCP. Is there a bug in our implementation?
| GetName(), buf.len, bytes_sent_ptr != nullptr ? *bytes_sent_ptr : 0, | ||
| overlapped, out, *wsa_error_code); | ||
| return out; | ||
| last_wsa_send_result_, *wsa_error_code); |
There was a problem hiding this comment.
Should this be ret, *wsa_error_code);?
There was a problem hiding this comment.
reverted this change so this is the way it was before
| "fd:|%s| created with params af:%d type:%d protocol:%d", | ||
| polled_fd->GetName(), af, type, protocol); | ||
| map->AddNewSocket(s, polled_fd); | ||
| auto insert_result = self->sockets_.insert({s, polled_fd}); |
There was a problem hiding this comment.
nit: maybe just one line GPR_ASSERT(self->sockets_.insert({s, polled_fd}).second); ?
|
|
||
| GrpcPolledFdWindows(ares_socket_t as, Mutex* mu, int address_family, | ||
| int socket_type) | ||
| int socket_type, std::function<void()> on_shutdown_locked) |
There was a problem hiding this comment.
I think we prefer absl::AnyInvocable<void()> over std::function<void()>: go/totw/191.
| if (!connect_done_) { | ||
| GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_); | ||
| pending_continue_register_for_on_writeable_locked_ = true; | ||
| grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_); |
There was a problem hiding this comment.
Why do we need to add this here? AFAICT, ConnectTCP() already calls grpc_socket_notify_on_write() for an async connect: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc#L567
There was a problem hiding this comment.
The problem with grpc_socket_notify_on_write within the connect callback is that we are registering an async callback without taking a ref.
When RegisterForOnWriteableLocked is called, the c-ares wrapper code takes a matching ref which will not get released until the writable callback is called. We should be holding a ref around the connect callback, so it's convenient to piggyback on the ref held over RegisterForOnWriteableLocked.
There was a problem hiding this comment.
Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?
There was a problem hiding this comment.
Yeah, that makes sense. Suggest add a comment here: // Register an async OnTcpConnect callback here since we are guaranteed to hold a ref of the c-ares wrapper before write_closure_ is called. .
Yeah, doing that in this PR too might have been overzealous and it's independent from the rest of the changes here. I've reverted back to the async WSA send approach (I don't actually see a bug in this code, it's just a lot of machinery) |
| #include "src/core/lib/iomgr/sockaddr_windows.h" | ||
| #include "src/core/lib/iomgr/socket_windows.h" | ||
| #include "src/core/lib/iomgr/tcp_windows.h" | ||
| #include "src/core/lib/iomgr/timer.h" |
| // grpc_winsocket_shutdown calls closesocket which invalidates our | ||
| // socket -> polled_fd mapping because the socket handle can be henceforth | ||
| // reused. | ||
| self->sockets_.erase(s); |
There was a problem hiding this comment.
Didn't we leak memory of the GrpcPolledFdWindows without a delete? Can we use std::map<SOCKET, std::unique_ptr<GrpcPolledFdWindows>> instead so that we don't need to worry about having to call delete explicitly?
There was a problem hiding this comment.
If we are erasing the polled fd from the map, then it's because ShutdownLocked was called on the polled fd. This means that the c-ares wrapper actually owns the polled fd - it deletes it here.
Note that the polled fd factory only deletes those polled fds not still remaining in the map at the end of the resolution - i.e. polled fds created by the c-ares library and not ever seen by the c-ares grpc wrapper.
There was a problem hiding this comment.
Got it. Thanks for the explanation! I have follow-up questions: Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()? In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?
There was a problem hiding this comment.
Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()?
Yes, this is the practical case I'm thinking of.
In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?
We could do this, but I don't see the benefit, and it would add extra complexity because we would need to track additional state per polled fd: something like bool owned_by_c_ares_lib_ which would start out true and be set false in NewGrpcPolledFdLocked.
CloseSocket would then need to check owned_by_c_ares_lib_, and if true we'd need to:
-
shut down the endpoint
-
remove the entry from the
sockets_map -
delete the endpoint
IMO, that would be more complex than the current model of this PR, where we just shutdown/destroy everything left in sockets_ at the end.
There was a problem hiding this comment.
SGTM, I'm all for simplifying this code.
| int ConnectUDP(WSAErrorContext* wsa_error_ctx, const struct sockaddr* target, | ||
| ares_socklen_t target_len) { | ||
| GRPC_CARES_TRACE_LOG("fd:%s ConnectUDP", GetName()); | ||
| GPR_ASSERT(!connect_done_); |
There was a problem hiding this comment.
sorry no real reason - undid this change
| if (!connect_done_) { | ||
| GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_); | ||
| pending_continue_register_for_on_writeable_locked_ = true; | ||
| grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_); |
There was a problem hiding this comment.
Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?
I don't actually have a repro of the original bugs seen in the wild, and I actually have not repro'd a crash here yet. Just reasoning about the code though - before this change there is nothing to prevent the |
This reverts commit fad4beb.
A set of simplifications to make this code easier to reason about:
Replace
SockToPolledFdMapwithstd::mapMake the c-ares
closecallback do nothing. Instead, let the ares wrapper code destroy polled fds as it normally does, and let everything that hasn't been registered for I/O get destroyed in theGrpcPolledFdFactoryWindowsdtor.Get rid of
GrpcPolledFdWindowsWrapperMove
socket_notify_on_writeto theRegisterForOnWriteableLockedmethod. This makes for a nice invariant that no async callback is pending unless aRegisterForOnWriteableLockedorRegisterForOnReadableLockedcallback is pending.Related: internal issue b/293321613