Skip to content

[c-ares DNS resolver] Fix file descriptor use-after-close bug when c-ares writes succeed but subsequent read fails#33871

Merged
apolcyn merged 47 commits intogrpc:masterfrom
apolcyn:fix_double_close
Jul 31, 2023
Merged

[c-ares DNS resolver] Fix file descriptor use-after-close bug when c-ares writes succeed but subsequent read fails#33871
apolcyn merged 47 commits intogrpc:masterfrom
apolcyn:fix_double_close

Conversation

@apolcyn
Copy link
Copy Markdown
Contributor

@apolcyn apolcyn commented Jul 25, 2023

Normally, c-ares related fds are destroyed after all DNS resolution is finished in this code path. Also there are some fds that c-ares may fail to open or write to initially, and c-ares will close them internally before grpc ever knows about them.

But if:

  1. c-ares opens a socket and successfully writes a request on it
  2. then a subsequent read fails

Then c-ares will close the fd in this code path, but gRPC will have a reference on the fd and will still use it afterwards.

Fix here is to leverage the c-ares socket-override API to properly track fd ownership between c-ares and grpc.

Related: internal issue b/292203138

@apolcyn apolcyn added the release notes: yes Indicates if PR needs to be in release notes label Jul 25, 2023
* - disable nagle */
grpc_error_handle err;
err = grpc_set_socket_nonblocking(fd, true);
if (!err.ok()) return -1;
Copy link
Copy Markdown
Member

@yijiem yijiem Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to close the socket when the set failed? Same below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like there is also this ares_set_socket_configure_callback which might be used for this purpose? And c-ares will call close socket when it returns -1: https://source.corp.google.com/piper///depot/google3/third_party/ares/src/lib/ares_process.c;l=1154;bpv=0;bpt=1;rcl=550417489

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed

Copy link
Copy Markdown
Contributor Author

@apolcyn apolcyn Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the ares_set_socket_configure_callback, thanks for pointing out - but I think it may be simpler to just do the configuration in here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend utilizing it since it seems to be specifically designed for this purpose. It would also call our version's of Close() (though it does not make any difference). But I wouldn't block on that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. And I agree now the config fallback does seem make this easier to read

Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix Alex!


// fds that are used/owned by grpc - we (grpc) will close them rather than
// c-ares
std::set<ares_socket_t> owned_fds_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe to use std::unordered_set<ares_socket_t> since it has on-average constant time complexity on search and insertion and also the ordering is not needed here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@apolcyn apolcyn changed the title [c-ares DNS resolver] Fix file descriptor double-close bug when c-ares writes succeed but subsequent read fails [c-ares DNS resolver] Fix file descriptor use-after-close bug when c-ares writes succeed but subsequent read fails Jul 26, 2023
@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Jul 26, 2023

Thanks for the pointer to use ARES_FLAG_USEVC @yijiem . The new test which uses that here alongside the socket use-after-close detector repro'd the bug ~90% with the fix removed.

@yijiem
Copy link
Copy Markdown
Member

yijiem commented Jul 26, 2023

Thanks for the pointer to use ARES_FLAG_USEVC @yijiem . The new test which uses that here alongside the socket use-after-close detector repro'd the bug ~100% with the fix removed.

Awesome! What's the repro'd error? It would be nice to link the error logs here. Thanks again!

Copy link
Copy Markdown
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test strategy looks great! Thanks!

~SocketUseAfterCloseDetector();

private:
std::thread* thread_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use std::unique_ptr<std::thread> so that we don't need to manually delete it after join.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@apolcyn
Copy link
Copy Markdown
Contributor Author

apolcyn commented Jul 31, 2023

Awesome! What's the repro'd error? It would be nice to link the error logs here. Thanks again!

With the fix temporarily reverted the fix (e.g. by reverting src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver_posix.cc back to its state before this PR -- I happened to try commit 7c21997dbadaee392e6fa2a65602741eebf7d119), then I'm hitting a ~90% failure rate when running bazel test //test/cpp/naming:cancel_ares_query_test@poller=epoll1 --runs_per_test=100 locally:

For example:

bazel test //test/cpp/naming:cancel_ares_query_test@poller=epoll1 --runs_per_test=100
...
INFO: Elapsed time: 18.084s, Critical Path: 7.87s
INFO: 101 processes: 101 linux-sandbox.
INFO: Build completed, 1 test FAILED, 101 total actions
//test/cpp/naming:cancel_ares_query_test@poller=epoll1                   FAILED in 89 out of 100 in 7.7s
  Stats over 100 runs: max = 7.7s, min = 1.9s, avg = 3.6s, dev = 1.0s

Spot checking a couple of different failure logs, notice how SocketUseAfterCloseDetector is catching the bug:

One failure:

Executing tests from //test/cpp/naming:cancel_ares_query_test@poller=epoll1
-----------------------------------------------------------------------------
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from CancelDuringAresQuery
[ RUN      ] CancelDuringAresQuery.TestCancelActiveDNSQuery
[       OK ] CancelDuringAresQuery.TestCancelActiveDNSQuery (27 ms)
[ RUN      ] CancelDuringAresQuery.TestFdsAreDeletedFromPollsetSet
[       OK ] CancelDuringAresQuery.TestFdsAreDeletedFromPollsetSet (29 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionIsGraceful
E0731 18:13:34.152610802      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionIsGraceful (141 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithQueryTimeoutIsGraceful
E0731 18:13:34.289009177      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithQueryTimeoutIsGraceful (35 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithZeroQueryTimeoutIsGraceful
E0731 18:13:34.320605985      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithZeroQueryTimeoutIsGraceful (135 ms)
[ RUN      ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket
E0731 18:13:34.461933496      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
test/core/util/socket_use_after_close_detector.cc:164: Failure
Value of: (*__errno_location ()) == 11 || (*__errno_location ()) == 11
  Actual: false
Expected: true
OpenAndCloseSocketsStressLoop accept on socket 8 failed in an unexpected way. errno: 22. Socket use-after-close bugs are likely.
[  FAILED  ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket (79 ms)
[----------] 6 tests from CancelDuringAresQuery (449 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (452 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket

 1 FAILED TEST

Another failure:

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //test/cpp/naming:cancel_ares_query_test@poller=epoll1
-----------------------------------------------------------------------------
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from CancelDuringAresQuery
[ RUN      ] CancelDuringAresQuery.TestCancelActiveDNSQuery
[       OK ] CancelDuringAresQuery.TestCancelActiveDNSQuery (51 ms)
[ RUN      ] CancelDuringAresQuery.TestFdsAreDeletedFromPollsetSet
[       OK ] CancelDuringAresQuery.TestFdsAreDeletedFromPollsetSet (35 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionIsGraceful
E0731 18:13:43.842740332      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionIsGraceful (145 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithQueryTimeoutIsGraceful
E0731 18:13:43.986466251      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithQueryTimeoutIsGraceful (47 ms)
[ RUN      ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithZeroQueryTimeoutIsGraceful
E0731 18:13:44.039301241      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
[       OK ] CancelDuringAresQuery.TestHitDeadlineAndDestroyChannelDuringAresResolutionWithZeroQueryTimeoutIsGraceful (142 ms)
[ RUN      ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket
E0731 18:13:44.173374582      12 cq_verifier.cc:363]                   Verify tag(1)-✅ for 10000ms
test/core/util/socket_use_after_close_detector.cc:164: Failure
Value of: (*__errno_location ()) == 11 || (*__errno_location ()) == 11
  Actual: false
Expected: true
OpenAndCloseSocketsStressLoop accept on socket 8 failed in an unexpected way. errno: 22. Socket use-after-close bugs are likely.
[  FAILED  ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket (76 ms)
[----------] 6 tests from CancelDuringAresQuery (498 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (503 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CancelDuringAresQuery.TestQueryFailsBecauseTcpServerClosesSocket

 1 FAILED TEST

@apolcyn apolcyn merged commit 76203ba into grpc:master Jul 31, 2023
yijiem added a commit that referenced this pull request Jul 31, 2023
… when c-ares writes succeed but subsequent read fails (#33871)"

This reverts commit 76203ba.
ananda1066 added a commit that referenced this pull request Aug 1, 2023
… when c-ares writes succeed but subsequent read fails" (#33934)

Reverts #33871 due to build failures in google3.

Co-authored-by: Yijie Ma <yijiem@google.com>
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 1, 2023
apolcyn added a commit that referenced this pull request Aug 1, 2023
…descriptor use-after-close bug when c-ares writes succeed but subsequent read fails" (#33934)" (#33942)

Rolls forward #33871

Second and third commits here fix internal build issues

In particular, add a `// IWYU pragma: no_include <ares_build.h>` since
`ares.h` [includes that
anyways](https://github.com/c-ares/c-ares/blob/bad62225b7f6b278b92e8e85a255600b629ef517/include/ares.h#L23)
(and seems unlikely for that to change since it would be breaking)
yijiem added a commit that referenced this pull request Aug 11, 2023
Port #33871 to EE's
GrpcPolledFdFactoryPosix.

<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
eugeneo pushed a commit to eugeneo/grpc that referenced this pull request Aug 21, 2023
Port grpc#33871 to EE's
GrpcPolledFdFactoryPosix.

<!--

If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.

If your pull request is for a specific language, please add the
appropriate
lang label.

-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/low imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants