Skip to content

[c-ares] update version to 1.34.5#39508

Closed
apolcyn wants to merge 41 commits intogrpc:masterfrom
apolcyn:update_ares
Closed

[c-ares] update version to 1.34.5#39508
apolcyn wants to merge 41 commits intogrpc:masterfrom
apolcyn:update_ares

Conversation

@apolcyn
Copy link
Contributor

@apolcyn apolcyn commented May 7, 2025

Fixes #39026

A few notheworthy changes here besides the version bump itself:

  • Reorder the way we pull in ares.h in various places, so that we make sure to pull in necessary windows header ahead of time. In older c-ares, the ares.h public header was more self-contained w.r.t. windows headers including winsock2.h. E.g. see old way of inclusion vs. new way of inclusion. I'm not crazy about this fix here but couldn't see a much cleaner way (we could maybe set CARES_HAVE_WINSOCK2_H in port_platform.h alternatively)

  • Upated hand-crafted config_windows/ares_config.h to set newly required build defs for the c-ares build itself to work on windows

  • Fix ruby macos cross-compilation build to correctly set SYSTEM Makefile var to Darwin (so that we pull in config_darwin/ares_config.h, so far we've been getting away with the linux build mode)

  • Change ruby macos MINGW32 cross-compilation build to use our hand crafted config_windows/ares_config.h header

  • Explicitly link iphlpapi on Windows. I think we used to pick this up during c-ares library build from this pragma, but now that's hidden under build define

@apolcyn apolcyn added release notes: yes Indicates if PR needs to be in release notes lang/core labels May 7, 2025
@apolcyn apolcyn changed the title Update c-ares [c-ares] update version to 1.34.5 May 7, 2025
@apolcyn
Copy link
Contributor Author

apolcyn commented May 8, 2025

New failure in windows builds: a lot of missing macro definitions

Failure on macos builds:

In file included from third_party/cares/cares/src/lib/ares_addrinfo2hostent.c:30:
In file included from third_party/cares/cares/src/lib/ares_private.h:36:
third_party/cares/cares/src/lib/ares_setup.h:112:12: fatal error: 'malloc.h' file not found

copybara-service bot pushed a commit that referenced this pull request Jun 10, 2025
…ts are complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

Currently there seems to be a bug causing //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns to fail on Windows in that PR:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jun 10, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug was seen on a Windows test with the iomgr c-ares resolver: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jun 10, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug was seen on a Windows test with the iomgr c-ares resolver: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jun 10, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug was seen on a Windows test with the iomgr c-ares resolver: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jun 11, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug was seen on a Windows test with the iomgr c-ares resolver: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jun 11, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug was seen on a Windows test with the iomgr c-ares resolver: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

- c-ares opens fd 1 to some server
- c-ares opens fd 2 to query a new server b/c old one is not responding
- some answer comes in on fd 2 so c-ares loses interest on fd 1

We'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jul 15, 2025
… complete

This is a slight simplification and bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

But it *looks like the same bug exists in both iomgr and EE variants:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jul 15, 2025
… complete

This is a bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jul 15, 2025
… complete

This is a bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jul 16, 2025
… complete

This is a bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547

----
DO NOT SUBMIT. This PR is for testing purposes only. [cl/769335111](http://cl/769335111)

PiperOrigin-RevId: 769335111
copybara-service bot pushed a commit that referenced this pull request Jul 16, 2025
This is a bug fix intended to unblock c-ares 1.34.5 upgrade: #39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547
PiperOrigin-RevId: 783612611
@apolcyn apolcyn marked this pull request as ready for review July 16, 2025 06:01
Copy link
Contributor

@veblush veblush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update

bazel_dep(name = "c-ares", version = "1.19.1", repo_name = "com_github_cares_cares")
by following up
bazelbuild/bazel-central-registry#5184

Note that this can be done in a separate PR.

asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Jul 22, 2025
This is a bug fix intended to unblock c-ares 1.34.5 upgrade: grpc#39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547
PiperOrigin-RevId: 783612611
asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Jul 22, 2025
Fixes grpc#39026

A few notheworthy changes here besides the version bump itself:

- Reorder the way we pull in `ares.h` in various places, so that we make sure to pull in necessary windows header ahead of time. In older c-ares, the `ares.h` public header was more self-contained w.r.t. windows headers including `winsock2.h`. E.g. see [old way of inclusion](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L64) vs. [new way of inclusion](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L52). I'm not crazy about this fix here but couldn't see a much cleaner way (we could maybe set `CARES_HAVE_WINSOCK2_H` in `port_platform.h` alternatively)

- Upated hand-crafted `config_windows/ares_config.h` to set newly required build defs for the c-ares build itself to work on windows

- Fix ruby macos cross-compilation build to correctly set `SYSTEM` Makefile var to `Darwin` (so that we pull in `config_darwin/ares_config.h`, so far we've been getting away with the linux build mode)

- Change ruby macos MINGW32 cross-compilation build to use our hand crafted `config_windows/ares_config.h` header

- Explicitly link `iphlpapi` on Windows. I think we used to pick this up during c-ares library build from [this pragma](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L71), but now that's [hidden under build define](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L59)

Closes grpc#39508

PiperOrigin-RevId: 783875246
asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Jul 23, 2025
This is a bug fix intended to unblock c-ares 1.34.5 upgrade: grpc#39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547
PiperOrigin-RevId: 783612611
asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Jul 23, 2025
Fixes grpc#39026

A few notheworthy changes here besides the version bump itself:

- Reorder the way we pull in `ares.h` in various places, so that we make sure to pull in necessary windows header ahead of time. In older c-ares, the `ares.h` public header was more self-contained w.r.t. windows headers including `winsock2.h`. E.g. see [old way of inclusion](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L64) vs. [new way of inclusion](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L52). I'm not crazy about this fix here but couldn't see a much cleaner way (we could maybe set `CARES_HAVE_WINSOCK2_H` in `port_platform.h` alternatively)

- Upated hand-crafted `config_windows/ares_config.h` to set newly required build defs for the c-ares build itself to work on windows

- Fix ruby macos cross-compilation build to correctly set `SYSTEM` Makefile var to `Darwin` (so that we pull in `config_darwin/ares_config.h`, so far we've been getting away with the linux build mode)

- Change ruby macos MINGW32 cross-compilation build to use our hand crafted `config_windows/ares_config.h` header

- Explicitly link `iphlpapi` on Windows. I think we used to pick this up during c-ares library build from [this pragma](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L71), but now that's [hidden under build define](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L59)

Closes grpc#39508

PiperOrigin-RevId: 783875246
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Aug 23, 2025
This is a bug fix intended to unblock c-ares 1.34.5 upgrade: grpc#39508

The bug can be seen by running any of the iomgr c-ares resolver `resolver_component_tests_runner_invoker`: //test/cpp/naming:resolver_component_tests_runner_invoker@experiment=no_event_engine_dns

In the general case, bug works like this:

1) c-ares initiates A and AAAA queries
2) c-ares opens fd 1 to some server that act as a black hole
3) query to server 1 times out and so c-ares opens fd 2 to query a new server b/c old one is not responding

c-ares 1.34.5 will internally increment a `consec_failures` field on the connection object when queries on fd 1 time out (see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L670)

4) answer to one of the queries comes in on fd 2

After processing the answer, c-ares [checks for connections to close out](https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_process.c#L233) and closes connections that have `consec_failures > 0`, see https://github.com/c-ares/c-ares/blob/d3a507e920e7af18a5efb7f9f1d8044ed4750013/src/lib/ares_close_sockets.c#L119

5) Now, c-ares will not indicate any interest in fd 1 from `ares_getsock` APIs

So, we'll now shutdown fd 1 which can have the effect of prematurely cancelling the overall request (see https://github.com/grpc/grpc/blob/06de3cc2ec8e73199a1be98f390cdabaedbb7364/src/core/lib/event_engine/ares_resolver.cc#L590) depending on ordering of events (reading another query's answer on fd 2 vs. when we run on_readable callback following shutdown of fd 1)

I guess this was not an issue before c-ares 1.34.5 b/c earlier c-ares did not have `consec_failures` tracking and so did not close sockets that had timed out earlier, after processing answers that came in on new fds, see https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L1528

---

Note: EE c-ares resolver does not have this early-cancellation issue b/c it shuts down no-longer-interesting fds with an OK status: https://github.com/grpc/grpc/blob/79769b35d04535259592ac1b0a98d65f63203f06/src/core/lib/event_engine/ares_resolver.cc#L547
PiperOrigin-RevId: 783612611
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Aug 23, 2025
Fixes grpc#39026

A few notheworthy changes here besides the version bump itself:

- Reorder the way we pull in `ares.h` in various places, so that we make sure to pull in necessary windows header ahead of time. In older c-ares, the `ares.h` public header was more self-contained w.r.t. windows headers including `winsock2.h`. E.g. see [old way of inclusion](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L64) vs. [new way of inclusion](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L52). I'm not crazy about this fix here but couldn't see a much cleaner way (we could maybe set `CARES_HAVE_WINSOCK2_H` in `port_platform.h` alternatively)

- Upated hand-crafted `config_windows/ares_config.h` to set newly required build defs for the c-ares build itself to work on windows

- Fix ruby macos cross-compilation build to correctly set `SYSTEM` Makefile var to `Darwin` (so that we pull in `config_darwin/ares_config.h`, so far we've been getting away with the linux build mode)

- Change ruby macos MINGW32 cross-compilation build to use our hand crafted `config_windows/ares_config.h` header

- Explicitly link `iphlpapi` on Windows. I think we used to pick this up during c-ares library build from [this pragma](https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/include/ares.h#L71), but now that's [hidden under build define](https://github.com/c-ares/c-ares/blob/42ddbc14ec008e738fa44aa2c16e74cad93742c2/include/ares.h#L59)

Closes grpc#39508

PiperOrigin-RevId: 783875246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade third_party/cares submodule

2 participants