Skip to content

fqdn/dnsproxy/proxy_test: increase timeout for DNS TCP client exchanges#12305

Merged
aanm merged 1 commit intomasterfrom
pr/qmonnet/dnsproxy_test_timeout
Jun 29, 2020
Merged

fqdn/dnsproxy/proxy_test: increase timeout for DNS TCP client exchanges#12305
aanm merged 1 commit intomasterfrom
pr/qmonnet/dnsproxy_test_timeout

Conversation

@qmonnet
Copy link
Copy Markdown
Member

@qmonnet qmonnet commented Jun 26, 2020

Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case.

Note that this timeout was at 1 second originally, and was reduced in commit 2b4badb ("dnsproxy: Correct tests to account for matchPattern generation. Reduce timeout time").

Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042

Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet qmonnet added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. needs-backport/1.7 ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 26, 2020
@qmonnet qmonnet requested review from a team June 26, 2020 13:54
@qmonnet
Copy link
Copy Markdown
Member Author

qmonnet commented Jun 26, 2020

test-focus RuntimePrivilegedUnitTests

@coveralls
Copy link
Copy Markdown

coveralls commented Jun 26, 2020

Coverage Status

Coverage increased (+0.007%) to 36.944% when pulling ef6f7e0 on pr/qmonnet/dnsproxy_test_timeout into db88419 on master.

@qmonnet
Copy link
Copy Markdown
Member Author

qmonnet commented Jun 29, 2020

test-focus RuntimePrivilegedUnitTests

@aanm aanm merged commit c93459e into master Jun 29, 2020
@aanm aanm deleted the pr/qmonnet/dnsproxy_test_timeout branch June 29, 2020 11:56
qmonnet added a commit that referenced this pull request Jul 21, 2020
Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
rolinh pushed a commit that referenced this pull request Jul 23, 2020
Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
pchaigno pushed a commit that referenced this pull request Jul 23, 2020
[ upstream commit fff58ef ]

Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
christarazi pushed a commit that referenced this pull request Jul 23, 2020
[ upstream commit fff58ef ]

Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
qmonnet added a commit that referenced this pull request Jul 30, 2020
[ upstream commit fff58ef ]

Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
joestringer pushed a commit that referenced this pull request Jul 30, 2020
[ upstream commit fff58ef ]

Follow-up to #12305, where we raised the timeout from 100ms to 500ms.
This seemed to suppress most of the flakes reported in #12042, but we
saw one again recently: Try restoring the timeout value to its original
value of 1 second.

Most of the time the RTT time for the exchange is way below 100ms anyway
and we won't have a difference on tests duration. In the worst and very
unlikely case where all DNS TCP exchanges are super-slow, we only have 5
exchanges in the tests and cannot spend more than a total 5 seconds on
them (or one would timeout and the test fail).

Fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/ci This PR makes changes to the CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: RuntimePrivilegedUnitTests: timeout in DNSProxyTestSuite.TestRespondViaCorrectProtocol

5 participants