Skip to content

Add a backup poller to c-ares resolver to take better advantage of c-ares query timeout/retry logic#17688

Merged
apolcyn merged 1 commit intogrpc:masterfrom
apolcyn:ares_timeouts
May 8, 2019
Merged

Add a backup poller to c-ares resolver to take better advantage of c-ares query timeout/retry logic#17688
apolcyn merged 1 commit intogrpc:masterfrom
apolcyn:ares_timeouts

Conversation

@apolcyn
Copy link
Copy Markdown
Contributor

@apolcyn apolcyn commented Jan 11, 2019

Follows the suggestion in #17436 (comment)

Fixes #18637

In case of non-responsive DNS servers, dropped packets, etc., c-ares has intelligent timeout and retry logic, which we can take better advantage of by polling ares_process_fd on time intervals. Currently, we already are kind of using c-ares timeouts, but by not polling on either a steady interval or on one given by ares_timeout(), we're partially snuffing c-ares's internal timeout/retry logic - i.e. we happen to check for timeouts only when fd events happen, but we should allow c-ares to check for timeouts even when they don't.

The change in this PR makes the following manual test pass, where before it failed:

  1. Edit your /etc/resolv.conf file to add nameserver <random ipv4 address> at the top of your nameserver list. For example, edit your /etc/resolv.conf file so that it changes from something like this:
nameserver AAAA
nameserver BBBB
nameserver CCCC

to:

nameserver <bogus name server>
nameserver AAAA
nameserver BBBB
nameserver CCCC
  1. Run the interop client:
make interop_client -j8
bins/opt/interop_client --server_host=grpc-test.sandbox.googleapis.com --server_port=443 --use_tls=true  --service_account_key_file=service_account/GrpcTesting-726eb1347f15.json --test_case=jwt_token_creds

After 10 seconds of hanging (during DNS resolution to the non-responsive nameserver at the top of the list), it will fail with:

E0403 11:29:30.422813019   14512 interop_client.cc:138]      Error status code: 14 (expected: 0), message: DNS resolution failed, debug string: {"created":"@1554316170.422439999","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2715,"referenced_errors":[{"created":"@1554316170.422427149","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":335,"referenced_errors":[{"created":"@1554316170.422412643","description":"DNS resolution failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":326,"grpc_status":14,"referenced_errors":[{"created":"@1554316170.422325321","description":"C-ares status is not ARES_SUCCESS: DNS query cancelled","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":240,"referenced_errors":[{"created":"@1554316170.422320413","description":"C-ares status is not ARES_SUCCESS: DNS query cancelled","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":240}]}]}]}]}
Aborted

After applying the change in this PR, the test passes, because c-ares will time out the queries to the non-responsive DNS server (by default, c-ares sets query timeouts starting at 5 seconds, and then increases that exponentially). The backup poller added here will give c-ares a chance to transition over to the next nameserver in the list and thus be able to carry out resolution successfully.

Also, I think that this will fix #18126. I was able to capture the test flake in that issue with cares_resolver tracing on, and the saw this log:

NUnitLite 3.10.1 (.NET Framework 4.5)
Copyright (c) 2018 Charlie Poole, Rob Prouse
Runtime Environment
   OS Version: Microsoft Windows NT 6.1.7601.65536
  CLR Version: 4.0.30319.42000
Test Files
    T:/src/github/grpc/workspace_csharp_windows_opt_native/src/csharp/Grpc.IntegrationTesting/bin/Release/net45/Grpc.IntegrationTesting.exe
Test Filters
    Test: Grpc.IntegrationTesting.ExternalDnsClientServerTest
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest
D0403 01:34:30.139648 Grpc.Core.Internal.UnmanagedLibrary Attempting to load native library "T:\src\github\grpc\workspace_csharp_windows_opt_native\src\csharp\Grpc.IntegrationTesting\bin\Release\net45\grpc_csharp_ext.x64.dll"
D0403 01:34:30.249023 Grpc.Core.Internal.NativeExtension gRPC native library loaded successfully.
D0403 01:34:30.249023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:436: Using ares dns resolver
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest.EmptyUnary
D0403 01:34:30.467773 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:176: (c-ares resolver) resolver:0000000019FB3A30 AresDnsResolver::StartLocked() is called.
D0403 01:34:30.467773 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_wrapper.cc:536: (c-ares resolver) request:0000000000C41B40 c-ares grpc_dns_lookup_ares_locked_impl name=loopback6.unittest.grpc.io:54268, default_port=https
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:143: (c-ares resolver) request:0000000000C41B40 grpc_ares_ev_driver_create_locked
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:285: (c-ares resolver) Connect: fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:251: (c-ares resolver) SendV called on fd:|c-ares socket: 968|. Current write state: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:224: (c-ares resolver) WSASend: name:c-ares socket: 968. buf len:44. bytes sent: 44. overlapped 0000000000000000. return val: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:251: (c-ares resolver) SendV called on fd:|c-ares socket: 968|. Current write state: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:224: (c-ares resolver) WSASend: name:c-ares socket: 968. buf len:44. bytes sent: 44. overlapped 0000000000000000. return val: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:289: (c-ares resolver) request:0000000000C41B40 new fd: c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:308: (c-ares resolver) request:0000000000C41B40 notify read on: c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:128: (c-ares resolver) RegisterForOnReadableLocked: WSARecvFrom error:|Overlapped I/O operation is in progress.
|. fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:361: (c-ares resolver) request:0000000000C41B40 ev_driver=0000000000C08D80 grpc_ares_ev_driver_start_locked. timeout in 10217 ms
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:395: (c-ares resolver) resolver:0000000019FB3A30 Started resolving. pending_request_:0000000000C41B40
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:332: (c-ares resolver) OnIocpReadable finishing. read buf length now:|118|. :fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:225: (c-ares resolver) request:0000000000C41B40 readable on c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|118|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:251: (c-ares resolver) SendV called on fd:|c-ares socket: 968|. Current write state: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:224: (c-ares resolver) WSASend: name:c-ares socket: 968. buf len:64. bytes sent: 64. overlapped 0000000000000000. return val: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|0|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:308: (c-ares resolver) request:0000000000C41B40 notify read on: c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:128: (c-ares resolver) RegisterForOnReadableLocked: WSARecvFrom error:|Overlapped I/O operation is in progress.
|. fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:101: (c-ares resolver) request:0000000000C41B40 Unref ev_driver 0000000000C08D80
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest.EmptyUnary
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:332: (c-ares resolver) OnIocpReadable finishing. read buf length now:|114|. :fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:225: (c-ares resolver) request:0000000000C41B40 readable on c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|114|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:251: (c-ares resolver) SendV called on fd:|c-ares socket: 968|. Current write state: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:224: (c-ares resolver) WSASend: name:c-ares socket: 968. buf len:73. bytes sent: 73. overlapped 0000000000000000. return val: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|0|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:308: (c-ares resolver) request:0000000000C41B40 notify read on: c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:128: (c-ares resolver) RegisterForOnReadableLocked: WSARecvFrom error:|Overlapped I/O operation is in progress.
|. fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:101: (c-ares resolver) request:0000000000C41B40 Unref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:332: (c-ares resolver) OnIocpReadable finishing. read buf length now:|162|. :fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:225: (c-ares resolver) request:0000000000C41B40 readable on c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|162|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:251: (c-ares resolver) SendV called on fd:|c-ares socket: 968|. Current write state: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:224: (c-ares resolver) WSASend: name:c-ares socket: 968. buf len:68. bytes sent: 68. overlapped 0000000000000000. return val: 0
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|0|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:308: (c-ares resolver) request:0000000000C41B40 notify read on: c-ares socket: 968
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:128: (c-ares resolver) RegisterForOnReadableLocked: WSARecvFrom error:|Overlapped I/O operation is in progress.
|. fd:|c-ares socket: 968|
D0403 01:34:30.483398 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:101: (c-ares resolver) request:0000000000C41B40 Unref ev_driver 0000000000C08D80
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:332: (c-ares resolver) OnIocpReadable finishing. read buf length now:|157|. :fd:|c-ares socket: 968|
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:225: (c-ares resolver) request:0000000000C41B40 readable on c-ares socket: 968
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|157|
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_wrapper.cc:239: (c-ares resolver) request:0000000000C41B40 on_hostbyname_done_locked host=loopback6.unittest.grpc.io C-ares status is not ARES_SUCCESS: DNS server returned answer with no data
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:177: (c-ares resolver) RecvFrom called on fd:|c-ares socket: 968|. Current read buf length:|0|
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:94: (c-ares resolver) request:0000000000C41B40 Ref ev_driver 0000000000C08D80
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:308: (c-ares resolver) request:0000000000C41B40 notify read on: c-ares socket: 968
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver_windows.cc:128: (c-ares resolver) RegisterForOnReadableLocked: WSARecvFrom error:|Overlapped I/O operation is in progress.
|. fd:|c-ares socket: 968|
D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:101: (c-ares resolver) request:0000000000C41B40 Unref ev_driver 0000000000C08D80
running empty_unary
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest
D0403 01:34:40.530273 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:208: (c-ares resolver) resolver:0000000019FB3A30 re-resolution timer fired. error: "Cancelled". shutdown_initiated_: 1
D0403 01:34:40.530273 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:167: (c-ares resolver) resolver:0000000019FB3A30 destroying AresDnsResolver
Errors, Failures and Warnings
1) Error : Grpc.IntegrationTesting.ExternalDnsClientServerTest.EmptyUnary
Grpc.Core.RpcException : Status(StatusCode=Unavailable, Detail="DNS resolution failed")
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Grpc.Core.Internal.AsyncCall`2.UnaryCall(TRequest msg)
   at Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.<BlockingUnaryCall>b__3_0[TRequest,TResponse](TRequest req, ClientInterceptorContext`2 ctx)
   at Grpc.Core.ClientBase.ClientBaseConfiguration.ClientBaseConfigurationInterceptor.BlockingUnaryCall[TRequest,TResponse](TRequest request, ClientInterceptorContext`2 context, BlockingUnaryCallContinuation`2 continuation)
   at Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[TRequest,TResponse](Method`2 method, String host, CallOptions options, TRequest request)
   at Grpc.Testing.TestService.TestServiceClient.EmptyCall(Empty request, CallOptions options)
   at Grpc.Testing.TestService.TestServiceClient.EmptyCall(Empty request, Metadata headers, Nullable`1 deadline, CancellationToken cancellationToken)
   at Grpc.IntegrationTesting.InteropClient.RunEmptyUnary(TestServiceClient client)
Run Settings
    Number of Test Workers: 1
    Work Directory: T:\src\github\grpc\workspace_csharp_windows_opt_native
    Internal Trace: Off
Test Run Summary
  Overall result: Failed
  Test Count: 1, Passed: 0, Failed: 1, Warnings: 0, Inconclusive: 0, Skipped: 0
    Failed Tests - Failures: 0, Errors: 1, Invalid: 0
  Start time: 2019-04-03 08:34:30Z
    End time: 2019-04-03 08:34:40Z
    Duration: 10.558 seconds

In particular, in that log, the jump between 01:34:30 and 01:34:40:

D0403 01:34:30.499023 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\grpc_ares_ev_driver.cc:101: (c-ares resolver) request:0000000000C41B40 Unref ev_driver 0000000000C08D80
running empty_unary
=> Grpc.IntegrationTesting.ExternalDnsClientServerTest
D0403 01:34:40.530273 0 T:\src\github\grpc\workspace_csharp_windows_opt_native\src\core\ext\filters\client_channel\resolver\dns\c_ares\dns_resolver_ares.cc:208: (c-ares resolver) resolver:0000000019FB3A30 re-resolution timer fired. error: "Cancelled". shutdown_initiated_: 1

... would be caused by the 10 second overall resolution timeout being triggered. Therefore, it seems that test flakes in that issue are likely due to transient failure to deliver DNS queries/responses (DNS queries always inherently risk being dropped anyways since UDP).

FTR it shouldn't block this change because it's pre-existing, but one thing that work on this change made brought to light is this c-ares issue: https://c-ares.haxx.se/mail/c-ares-archive-2015-03/0009.shtml


edit: I think that this will fix #18126, not necessarily #18411


This change is Reviewable

@apolcyn apolcyn added release notes: yes Indicates if PR needs to be in release notes lang/core labels Jan 11, 2019
@apolcyn apolcyn changed the title WIP: Revise c-ares timeouts to use c-ares's internal timeout/retry logic Revise c-ares timeouts to use c-ares's internal timeout/retry logic Jan 11, 2019
@apolcyn apolcyn force-pushed the ares_timeouts branch 2 times, most recently from b5b8faf to d0bd0da Compare January 11, 2019 04:49
@apolcyn apolcyn changed the title Revise c-ares timeouts to use c-ares's internal timeout/retry logic WIP: Revise c-ares timeouts to use c-ares's internal timeout/retry logic Jan 15, 2019
@apolcyn apolcyn changed the title WIP: Revise c-ares timeouts to use c-ares's internal timeout/retry logic Add a backup poller to c-ares to take better advantage of timeout/retry logic Jan 15, 2019
@apolcyn apolcyn changed the title Add a backup poller to c-ares to take better advantage of timeout/retry logic Add a backup driver to c-ares resolver to take better advantage of timeout/retry logic Jan 15, 2019
@apolcyn apolcyn force-pushed the ares_timeouts branch 2 times, most recently from f251eb2 to 683f135 Compare January 23, 2019 07:31
@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Jan 23, 2019

@apolcyn apolcyn changed the title Add a backup driver to c-ares resolver to take better advantage of timeout/retry logic Add a backup driver to c-ares resolver to take better advantage of c-ares's query timeout/retry logic Apr 3, 2019
@apolcyn apolcyn changed the title Add a backup driver to c-ares resolver to take better advantage of c-ares's query timeout/retry logic Add a backup driver to c-ares resolver to take better advantage of c-ares query timeout/retry logic Apr 3, 2019
@apolcyn apolcyn changed the title Add a backup driver to c-ares resolver to take better advantage of c-ares query timeout/retry logic Add a backup poller to c-ares resolver to take better advantage of c-ares query timeout/retry logic Apr 4, 2019
@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Apr 26, 2019

Friendly ping.

Also updated to add an automated version of the manual test described here.

@jtattermusch
Copy link
Copy Markdown
Contributor

Can we get this reviewed and merged? This seems to be fixing #18637 which has been failing regularly for a while.

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this. The overall change looks good; all my comments are minor. Please let me know if you have any questions.

Reviewed 3 of 5 files at r1, 17 of 17 files at r2.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.cc, line 250 at r2 (raw file):

}

static void on_ares_backup_poll_alarm_locked(void* arg, grpc_error* error) {

Please add a comment here explaining some of what you currently describe in the PR description. Otherwise, it's not obvious to future readers why this is being done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc, line 179 at r2 (raw file):

        "RecvFrom called on fd:|%s|. Current read buf length:|%d|", GetName(),
        GRPC_SLICE_LENGTH(read_buf_));
    if (!read_buf_has_data_) {

Please let me know if I'm missing something here, but the changes in this file look unrelated to the rest of this PR. If so, maybe split them into a separate PR?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 29 at r2 (raw file):

#include "src/core/lib/iomgr/resolve_address.h"

#define GRPC_DNS_ARES_DEFAULT_QUERY_TIMEOUT_MS 120000

Is this change intentional? 2 minutes seems like a very long time.


test/core/iomgr/resolve_address_test.cc, line 86 at r2 (raw file):

static void poll_pollset_until_request_done(args_struct* args) {
  grpc_core::ExecCtx exec_ctx;
  // Try to give enough time for c-ares to run through it's retries

s/it's/its/


test/core/iomgr/resolve_address_test.cc, line 379 at r2 (raw file):

    test_unparseable_hostports();
    if (gpr_stricmp(resolver_type, "ares") == 0) {
      // This behavior expecation is specific to c-ares.

s/expecation/expectation/


test/cpp/naming/dns_test_util.cc, line 27 at r2 (raw file):

#include "test/cpp/naming/dns_test_util.h"

#ifdef GPR_WINDOWS

Do we already have individual tests that use these kind of compile-time conditionals for portability? If not, we should probably not start doing this now. Our general rule (which we're already bending somewhat for the c-ares resolver) has been that the only directories we allow this in are iomgr, gpr, and gprpp.


test/cpp/naming/dns_test_util.cc, line 70 at r2 (raw file):

  int val = 1;
  if (setsockopt(tcp_socket_, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) !=
      0) {

Looks like the only difference between this case and the Windows case is the return value. Maybe use the same approach we're using for BAD_SOCKET_RETURN_VAL here, to avoid duplication?


test/cpp/naming/gen_build_yaml.py, line 125 at r2 (raw file):

      ] + [
          {
          'name': 'cancel_ares_query_test',

Why is this going away?


test/cpp/naming/resolver_component_test.cc, line 562 at r2 (raw file):

    abort();
  }
  if (FLAGS_local_dns_server_address != "") {

Why is this going away?


test/cpp/naming/resolver_component_test.cc, line 613 at r2 (raw file):

                                         grpc_combiner_scheduler(args.lock)),
                     GRPC_ERROR_NONE);
  resolver->StartLocked();

This line should presumably go away, now that we're bouncing into the combiner to do this.

Copy link
Copy Markdown
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 20 files reviewed, 10 unresolved discussions (waiting on @AspirinSJL and @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.cc, line 250 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a comment here explaining some of what you currently describe in the PR description. Otherwise, it's not obvious to future readers why this is being done.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc, line 179 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please let me know if I'm missing something here, but the changes in this file look unrelated to the rest of this PR. If so, maybe split them into a separate PR?

This was subtle, but it's actually needed for Windows to continue to function with this PR.

The problem stems from the fact that currently, the windows recv_from override detects a presence or absence of received data based only on the length of read_buf_.

read_buf_ has non-zero length in two situations:

  1. After calling RegisterForOnReadable, we alloc a blank slice for read_buf_ for the purpose of reading data into through the Windows RecvFrom API
  2. After we getting an I/O event and read data into the fd, read_buf_ is filled with the data received on the socket

We should only be considering the GrpcPolledFd to be "ready" in situation 2), but the current code before this PR also considers it "ready" in situation 1)

This is actually a bug even in the existing code, but it's just very hard to contrive a way to trigger it. By introducing the backup poller, tests immediately start to hit this bug, and so it needed to be fixed for this PR.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 29 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is this change intentional? 2 minutes seems like a very long time.

Yeah this is intended. My idea here is to basically let the c-ares library take control over the timeout logic and not interfere with it, and let this timeout on the overall name resolution process just be a hard, worst case scenario sort of thing that shouldn't ever be reached in practice.

By default, c-ares times out individual queries by starting out with a 5 second timeout, and then increasing that exponentially on subsequent retries, with some amount of jitter. By default there's a limit of 4 retries per query. So on default settings (these settings can be overridden by environment variables that c-ares respects), that sums to ~75 seconds. 2 minutes is just rounding up from there, but I could also see shortening to e.g. 80 or 90 seconds.

I made a comment on the channel arg to explain that this is only an upper bound, and that the actual DNS query timeout may be shorter than this due to c-ares library internal timeout logic.

Please let me know....


test/core/iomgr/resolve_address_test.cc, line 86 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/it's/its/

Done.


test/core/iomgr/resolve_address_test.cc, line 379 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/expecation/expectation/

Done.


test/cpp/naming/dns_test_util.cc, line 27 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Do we already have individual tests that use these kind of compile-time conditionals for portability? If not, we should probably not start doing this now. Our general rule (which we're already bending somewhat for the c-ares resolver) has been that the only directories we allow this in are iomgr, gpr, and gprpp.

Mostly all of the c-ares tests already do this: e.g. address_sorting_test.cc, cancel_ares_query_test.cc, resolver_component_test.cc

FWIW I've seen some other tests e.g. with GPR_WINDOWS macro logic:
end2end/fixtures/h2_full+trace.cc
end2end/fixtures/h2_sockpair+trace.cc


test/cpp/naming/dns_test_util.cc, line 70 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Looks like the only difference between this case and the Windows case is the return value. Maybe use the same approach we're using for BAD_SOCKET_RETURN_VAL here, to avoid duplication?

The other thing is the type of val: Windows takes a const char*, and linux takes a void* over an int..... IMO getting this to be portable is more trouble than it's worth (plus I think it's not possible because of ABI differences).


test/cpp/naming/gen_build_yaml.py, line 125 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this going away?

Whoops, good catch. Sorry this was accidentally removed and not cleaned up for the PR diff.


test/cpp/naming/resolver_component_test.cc, line 562 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this going away?

This log is effectively just moved to within the } else if (FLAGS_inject_broken_nameserver_list == "False") { branch of the RunResolvesRelevantRecordsTest


test/cpp/naming/resolver_component_test.cc, line 613 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This line should presumably go away, now that we're bouncing into the combiner to do this.

Whoops, fixed

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Reviewed 9 of 10 files at r3.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @AspirinSJL)

@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented May 8, 2019

Thanks for review. Rebased/squashed down commits.

@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented May 8, 2019

basic c/c++ macos: #18968
basic objc macos: #18997
basic php macos: #18968
basic python macos: #18968
basic ruby macos: #18968
interop cloud to cloud: #18967
interop cloud to prod: #18967

@apolcyn apolcyn merged commit 9278411 into grpc:master May 8, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Aug 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

lang/core release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

3 participants