[windows DNS] Simplify c-ares Windows code by apolcyn · Pull Request #33965 · grpc/grpc

apolcyn · 2023-08-02T19:13:07Z

A set of simplifications to make this code easier to reason about:

Replace SockToPolledFdMap with std::map
Make the c-ares close callback do nothing. Instead, let the ares wrapper code destroy polled fds as it normally does, and let everything that hasn't been registered for I/O get destroyed in the GrpcPolledFdFactoryWindows dtor.
Get rid of GrpcPolledFdWindowsWrapper
Move socket_notify_on_write to the RegisterForOnWriteableLocked method. This makes for a nice invariant that no async callback is pending unless a RegisterForOnWriteableLocked or RegisterForOnReadableLocked callback is pending.

Related: internal issue b/293321613

yijiem

Thanks Alex. While I understand the simplification of changing from using async WSA send to sync WSA send on a non-blocking socket, I'm wondering why are we making this change since we have already developed the machinery for doing the async WSA send with IOCP. Is there a bug in our implementation?

yijiem · 2023-08-04T04:58:26Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

        GetName(), buf.len, bytes_sent_ptr != nullptr ? *bytes_sent_ptr : 0,
-        overlapped, out, *wsa_error_code);
-    return out;
+        last_wsa_send_result_, *wsa_error_code);


Should this be ret, *wsa_error_code);?

reverted this change so this is the way it was before

yijiem · 2023-08-04T16:55:06Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

        "fd:|%s| created with params af:%d type:%d protocol:%d",
        polled_fd->GetName(), af, type, protocol);
-    map->AddNewSocket(s, polled_fd);
+    auto insert_result = self->sockets_.insert({s, polled_fd});


nit: maybe just one line GPR_ASSERT(self->sockets_.insert({s, polled_fd}).second); ?

yijiem · 2023-08-04T18:12:44Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

-
  GrpcPolledFdWindows(ares_socket_t as, Mutex* mu, int address_family,
-                      int socket_type)
+                      int socket_type, std::function<void()> on_shutdown_locked)


I think we prefer absl::AnyInvocable<void()> over std::function<void()>: go/totw/191.

yijiem · 2023-08-04T20:57:59Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

+    if (!connect_done_) {
+      GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_);
      pending_continue_register_for_on_writeable_locked_ = true;
+      grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_);


Why do we need to add this here? AFAICT, ConnectTCP() already calls grpc_socket_notify_on_write() for an async connect: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc#L567

The problem with grpc_socket_notify_on_write within the connect callback is that we are registering an async callback without taking a ref.

When RegisterForOnWriteableLocked is called, the c-ares wrapper code takes a matching ref which will not get released until the writable callback is called. We should be holding a ref around the connect callback, so it's convenient to piggyback on the ref held over RegisterForOnWriteableLocked.

Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

Yeah, that makes sense. Suggest add a comment here: // Register an async OnTcpConnect callback here since we are guaranteed to hold a ref of the c-ares wrapper before write_closure_ is called. .

added the comment

apolcyn · 2023-08-23T18:00:27Z

Thanks Alex. While I understand the simplification of changing from using async WSA send to sync WSA send on a non-blocking socket, I'm wondering why are we making this change since we have already developed the machinery for doing the async WSA send with IOCP. Is there a bug in our implementation?

Yeah, doing that in this PR too might have been overzealous and it's independent from the rest of the changes here. I've reverted back to the async WSA send approach (I don't actually see a bug in this code, it's just a lot of machinery)

yijiem · 2023-08-24T21:18:21Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

 #include "src/core/lib/iomgr/sockaddr_windows.h"
 #include "src/core/lib/iomgr/socket_windows.h"
 #include "src/core/lib/iomgr/tcp_windows.h"
+#include "src/core/lib/iomgr/timer.h"


Is this still needed?

yijiem · 2023-08-24T22:35:28Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

+      // grpc_winsocket_shutdown calls closesocket which invalidates our
+      // socket -> polled_fd mapping because the socket handle can be henceforth
+      // reused.
+      self->sockets_.erase(s);


Didn't we leak memory of the GrpcPolledFdWindows without a delete? Can we use std::map<SOCKET, std::unique_ptr<GrpcPolledFdWindows>> instead so that we don't need to worry about having to call delete explicitly?

If we are erasing the polled fd from the map, then it's because ShutdownLocked was called on the polled fd. This means that the c-ares wrapper actually owns the polled fd - it deletes it here.

Note that the polled fd factory only deletes those polled fds not still remaining in the map at the end of the resolution - i.e. polled fds created by the c-ares library and not ever seen by the c-ares grpc wrapper.

Got it. Thanks for the explanation! I have follow-up questions: Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()? In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?

Is it because some operations on the socket encountered some errors so that c-ares decided to close it without returning it back to the c-ares wrapper through ares_getsock()?

Yes, this is the practical case I'm thinking of.

In that case, why don't we just actually close that socket in CloseSocket and destroy the GrpcPolledFdWindows but wait till the end when the factory is destroyed?

We could do this, but I don't see the benefit, and it would add extra complexity because we would need to track additional state per polled fd: something like bool owned_by_c_ares_lib_ which would start out true and be set false in NewGrpcPolledFdLocked.

CloseSocket would then need to check owned_by_c_ares_lib_, and if true we'd need to:

shut down the endpoint

remove the entry from the sockets_ map

delete the endpoint

IMO, that would be more complex than the current model of this PR, where we just shutdown/destroy everything left in sockets_ at the end.

SGTM, I'm all for simplifying this code.

yijiem · 2023-08-24T22:42:57Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

  int ConnectUDP(WSAErrorContext* wsa_error_ctx, const struct sockaddr* target,
                 ares_socklen_t target_len) {
    GRPC_CARES_TRACE_LOG("fd:%s ConnectUDP", GetName());
-    GPR_ASSERT(!connect_done_);


Why remove this?

sorry no real reason - undid this change

yijiem · 2023-08-24T23:23:04Z

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc

+    if (!connect_done_) {
+      GPR_ASSERT(!pending_continue_register_for_on_writeable_locked_);
      pending_continue_register_for_on_writeable_locked_ = true;
+      grpc_socket_notify_on_write(winsocket_, &on_tcp_connect_locked_);


Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

apolcyn · 2023-08-25T18:26:58Z

Ok, this seems very subtle. Did the issue happen when we are shutting down between an async TCP connect and its completion? What exactly happened? Did the OnTcpConnect callback get called later because the poller was shutting down and causing crash or further failures?

I don't actually have a repro of the original bugs seen in the wild, and I actually have not repro'd a crash here yet.

Just reasoning about the code though - before this change there is nothing to prevent the TcpConnect callback from being invoked after the polled fd is destroyed.

This reverts commit fad4beb.

Unskip since #33965 merged

apolcyn added 5 commits August 1, 2023 23:42

simplify c-ares windows code

6449e84

progress, using delay timer

143dc82

move connect to on_writable

0c84ba4

simplify code

07df437

format

de9d71b

github-actions bot added the lang/core label Aug 2, 2023

grpc-checks bot added per-call-memory/neutral per-channel-memory/neutral bloat/none labels Aug 2, 2023

apolcyn added 16 commits August 2, 2023 20:18

fixes

fbeb2a8

format code

764e2c5

fix

26bf142

fix

0e1ef14

fix

5d44b33

sanity

9f51ed0

format

02f5754

fix

2601bf6

fixes

13177a5

format

8453d00

fix

9bca55f

fix

4f38b14

fix

1fd7513

fix

1d2a2ad

fixes

e3cf57c

format

cf0b7dd

apolcyn marked this pull request as ready for review August 3, 2023 15:51

apolcyn requested a review from markdroth as a code owner August 3, 2023 15:51

yijiem reviewed Aug 4, 2023

View reviewed changes

yijiem requested a review from drfloob August 4, 2023 22:11

apolcyn added 7 commits August 22, 2023 23:30

keep TCP write machinery as is

a2f83d2

remove dead code

a7baac8

fix log

da6472d

use any invocable

0c48324

fix build

41b4b5d

fix build

511bb8d

put original logs back in

cd90ba5

yijiem reviewed Aug 24, 2023

View reviewed changes

apolcyn added 2 commits August 25, 2023 18:21

undo needless chanbge

9fda93d

Remove header

0e4c9d5

add comment

5c3cebf

yijiem approved these changes Aug 25, 2023

View reviewed changes

apolcyn added 2 commits August 30, 2023 04:41

Merge remote-tracking branch 'origin/master' into simpler_windows_dns

490b352

test change

fad4beb

github-actions bot added the lang/c++ label Aug 30, 2023

Revert "test change"

c4263e0

This reverts commit fad4beb.

apolcyn merged commit 5d85d7d into grpc:master Aug 30, 2023

copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 30, 2023

apolcyn mentioned this pull request Aug 30, 2023

[DNS test] unskip a test on windows #34209

Merged

apolcyn added a commit that referenced this pull request Aug 31, 2023

[DNS test] unskip a test on windows (#34209)

2d2e989

Unskip since #33965 merged

This was referenced Sep 26, 2023

GRPC C++ crash in SockToPolledFdMap::CloseSocket() for Windows #34483

Closed

free_base crash for Windows C++ GRPC inside GrpcPolledFdWindows::ContinueRegisterForOnReadableLocked() #34540

Closed

apolcyn mentioned this pull request Jan 31, 2024

Sporadic crashes in grpc_core::GrpcPolledFdWindows::SendVUDP #22555

Closed

ti-chi-bot bot mentioned this pull request Jul 29, 2025

update v1.59.0 tikv/grpc#55

Open

Conversation

apolcyn commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yijiem left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yijiem Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apolcyn Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yijiem Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apolcyn commented Aug 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yijiem Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apolcyn Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yijiem Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apolcyn commented Aug 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

apolcyn commented Aug 2, 2023 •

edited

Loading

yijiem Aug 4, 2023 •

edited

Loading

apolcyn Aug 23, 2023 •

edited

Loading

yijiem Aug 24, 2023 •

edited

Loading

yijiem Aug 24, 2023 •

edited

Loading

apolcyn Aug 25, 2023 •

edited

Loading

yijiem Aug 24, 2023 •

edited

Loading