Add a backup poller to c-ares resolver to take better advantage of c-ares query timeout/retry logic#17688
Conversation
6a6f5e8 to
9595ef5
Compare
b5b8faf to
d0bd0da
Compare
f251eb2 to
683f135
Compare
|
test failures:
|
|
Friendly ping. Also updated to add an automated version of the manual test described here. |
|
Can we get this reviewed and merged? This seems to be fixing #18637 which has been failing regularly for a while. |
markdroth
left a comment
There was a problem hiding this comment.
Sorry for the delay on this. The overall change looks good; all my comments are minor. Please let me know if you have any questions.
Reviewed 3 of 5 files at r1, 17 of 17 files at r2.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @apolcyn and @AspirinSJL)
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.cc, line 250 at r2 (raw file):
} static void on_ares_backup_poll_alarm_locked(void* arg, grpc_error* error) {
Please add a comment here explaining some of what you currently describe in the PR description. Otherwise, it's not obvious to future readers why this is being done.
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc, line 179 at r2 (raw file):
"RecvFrom called on fd:|%s|. Current read buf length:|%d|", GetName(), GRPC_SLICE_LENGTH(read_buf_)); if (!read_buf_has_data_) {
Please let me know if I'm missing something here, but the changes in this file look unrelated to the rest of this PR. If so, maybe split them into a separate PR?
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 29 at r2 (raw file):
#include "src/core/lib/iomgr/resolve_address.h" #define GRPC_DNS_ARES_DEFAULT_QUERY_TIMEOUT_MS 120000
Is this change intentional? 2 minutes seems like a very long time.
test/core/iomgr/resolve_address_test.cc, line 86 at r2 (raw file):
static void poll_pollset_until_request_done(args_struct* args) { grpc_core::ExecCtx exec_ctx; // Try to give enough time for c-ares to run through it's retries
s/it's/its/
test/core/iomgr/resolve_address_test.cc, line 379 at r2 (raw file):
test_unparseable_hostports(); if (gpr_stricmp(resolver_type, "ares") == 0) { // This behavior expecation is specific to c-ares.
s/expecation/expectation/
test/cpp/naming/dns_test_util.cc, line 27 at r2 (raw file):
#include "test/cpp/naming/dns_test_util.h" #ifdef GPR_WINDOWS
Do we already have individual tests that use these kind of compile-time conditionals for portability? If not, we should probably not start doing this now. Our general rule (which we're already bending somewhat for the c-ares resolver) has been that the only directories we allow this in are iomgr, gpr, and gprpp.
test/cpp/naming/dns_test_util.cc, line 70 at r2 (raw file):
int val = 1; if (setsockopt(tcp_socket_, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val)) != 0) {
Looks like the only difference between this case and the Windows case is the return value. Maybe use the same approach we're using for BAD_SOCKET_RETURN_VAL here, to avoid duplication?
test/cpp/naming/gen_build_yaml.py, line 125 at r2 (raw file):
] + [ { 'name': 'cancel_ares_query_test',
Why is this going away?
test/cpp/naming/resolver_component_test.cc, line 562 at r2 (raw file):
abort(); } if (FLAGS_local_dns_server_address != "") {
Why is this going away?
test/cpp/naming/resolver_component_test.cc, line 613 at r2 (raw file):
grpc_combiner_scheduler(args.lock)), GRPC_ERROR_NONE); resolver->StartLocked();
This line should presumably go away, now that we're bouncing into the combiner to do this.
apolcyn
left a comment
There was a problem hiding this comment.
Reviewable status: 11 of 20 files reviewed, 10 unresolved discussions (waiting on @AspirinSJL and @markdroth)
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.cc, line 250 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please add a comment here explaining some of what you currently describe in the PR description. Otherwise, it's not obvious to future readers why this is being done.
Done.
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_windows.cc, line 179 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please let me know if I'm missing something here, but the changes in this file look unrelated to the rest of this PR. If so, maybe split them into a separate PR?
This was subtle, but it's actually needed for Windows to continue to function with this PR.
The problem stems from the fact that currently, the windows recv_from override detects a presence or absence of received data based only on the length of read_buf_.
read_buf_ has non-zero length in two situations:
- After calling
RegisterForOnReadable, we alloc a blank slice forread_buf_for the purpose of reading data into through the WindowsRecvFromAPI - After we getting an I/O event and read data into the fd,
read_buf_is filled with the data received on the socket
We should only be considering the GrpcPolledFd to be "ready" in situation 2), but the current code before this PR also considers it "ready" in situation 1)
This is actually a bug even in the existing code, but it's just very hard to contrive a way to trigger it. By introducing the backup poller, tests immediately start to hit this bug, and so it needed to be fixed for this PR.
src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 29 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Is this change intentional? 2 minutes seems like a very long time.
Yeah this is intended. My idea here is to basically let the c-ares library take control over the timeout logic and not interfere with it, and let this timeout on the overall name resolution process just be a hard, worst case scenario sort of thing that shouldn't ever be reached in practice.
By default, c-ares times out individual queries by starting out with a 5 second timeout, and then increasing that exponentially on subsequent retries, with some amount of jitter. By default there's a limit of 4 retries per query. So on default settings (these settings can be overridden by environment variables that c-ares respects), that sums to ~75 seconds. 2 minutes is just rounding up from there, but I could also see shortening to e.g. 80 or 90 seconds.
I made a comment on the channel arg to explain that this is only an upper bound, and that the actual DNS query timeout may be shorter than this due to c-ares library internal timeout logic.
Please let me know....
test/core/iomgr/resolve_address_test.cc, line 86 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
s/it's/its/
Done.
test/core/iomgr/resolve_address_test.cc, line 379 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
s/expecation/expectation/
Done.
test/cpp/naming/dns_test_util.cc, line 27 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Do we already have individual tests that use these kind of compile-time conditionals for portability? If not, we should probably not start doing this now. Our general rule (which we're already bending somewhat for the c-ares resolver) has been that the only directories we allow this in are iomgr, gpr, and gprpp.
Mostly all of the c-ares tests already do this: e.g. address_sorting_test.cc, cancel_ares_query_test.cc, resolver_component_test.cc
FWIW I've seen some other tests e.g. with GPR_WINDOWS macro logic:
end2end/fixtures/h2_full+trace.cc
end2end/fixtures/h2_sockpair+trace.cc
test/cpp/naming/dns_test_util.cc, line 70 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Looks like the only difference between this case and the Windows case is the return value. Maybe use the same approach we're using for
BAD_SOCKET_RETURN_VALhere, to avoid duplication?
The other thing is the type of val: Windows takes a const char*, and linux takes a void* over an int..... IMO getting this to be portable is more trouble than it's worth (plus I think it's not possible because of ABI differences).
test/cpp/naming/gen_build_yaml.py, line 125 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Why is this going away?
Whoops, good catch. Sorry this was accidentally removed and not cleaned up for the PR diff.
test/cpp/naming/resolver_component_test.cc, line 562 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Why is this going away?
This log is effectively just moved to within the } else if (FLAGS_inject_broken_nameserver_list == "False") { branch of the RunResolvesRelevantRecordsTest
test/cpp/naming/resolver_component_test.cc, line 613 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This line should presumably go away, now that we're bouncing into the combiner to do this.
Whoops, fixed
markdroth
left a comment
There was a problem hiding this comment.
This looks great!
Reviewed 9 of 10 files at r3.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @AspirinSJL)
|
Thanks for review. Rebased/squashed down commits. |
Follows the suggestion in #17436 (comment)
Fixes #18637
In case of non-responsive DNS servers, dropped packets, etc., c-ares has intelligent timeout and retry logic, which we can take better advantage of by polling
ares_process_fdon time intervals. Currently, we already are kind of using c-ares timeouts, but by not polling on either a steady interval or on one given byares_timeout(), we're partially snuffing c-ares's internal timeout/retry logic - i.e. we happen to check for timeouts only when fd events happen, but we should allow c-ares to check for timeouts even when they don't.The change in this PR makes the following manual test pass, where before it failed:
/etc/resolv.conffile to addnameserver <random ipv4 address>at the top of yournameserverlist. For example, edit your/etc/resolv.conffile so that it changes from something like this:to:
After 10 seconds of hanging (during DNS resolution to the non-responsive nameserver at the top of the list), it will fail with:
After applying the change in this PR, the test passes, because c-ares will time out the queries to the non-responsive DNS server (by default, c-ares sets query timeouts starting at 5 seconds, and then increases that exponentially). The backup poller added here will give c-ares a chance to transition over to the next nameserver in the list and thus be able to carry out resolution successfully.
Also, I think that this will fix #18126. I was able to capture the test flake in that issue with
cares_resolvertracing on, and the saw this log:In particular, in that log, the jump between 01:34:30 and 01:34:40:
... would be caused by the 10 second overall resolution timeout being triggered. Therefore, it seems that test flakes in that issue are likely due to transient failure to deliver DNS queries/responses (DNS queries always inherently risk being dropped anyways since UDP).
FTR it shouldn't block this change because it's pre-existing, but one thing that work on this change made brought to light is this c-ares issue: https://c-ares.haxx.se/mail/c-ares-archive-2015-03/0009.shtml
edit: I think that this will fix #18126, not necessarily #18411
This change is