[DEBUG] proxy_test: evaluate rtt value for DNS requests#12298
Closed
qmonnet wants to merge 1 commit intocilium:masterfrom
Closed
[DEBUG] proxy_test: evaluate rtt value for DNS requests#12298qmonnet wants to merge 1 commit intocilium:masterfrom
qmonnet wants to merge 1 commit intocilium:masterfrom
Conversation
Member
Author
|
test-only --focus="RuntimePrivilegedUnitTests" |
Member
Author
|
test-focus RuntimePrivilegedUnitTests |
As an attempt to debug a flake in the runtime privileged tests where the DNS request times out, runs the relevant test many times (10^6) and observe the RTT duration to see if it sometimes get longer than the current timeout value used for the server (100ms). On my local setup I get 1 run out of 20k at more than 10ms, and 1 in 1M above 30ms, so the 100ms seems way enough... But let's see what Jenkins says. Tweak test files and Makefile to run just the function we want, and make sure we fail at the end so we get logs. Signed-off-by: Quentin Monnet <quentin@isovalent.com>
459032c to
d6391e8
Compare
Member
Author
|
test-focus RuntimePrivilegedUnitTests https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/319/ Tried to run the test a million time; It stopped at 790k, having exhausted the available memory. Under heavy memory pressure, timeout rose frequently to 100ms+, and up to 296ms. |
qmonnet
added a commit
that referenced
this pull request
Jun 26, 2020
Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com>
aanm
pushed a commit
that referenced
this pull request
Jun 29, 2020
Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com>
christarazi
pushed a commit
that referenced
this pull request
Jun 30, 2020
[ upstream commit c93459e ] Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer
pushed a commit
that referenced
this pull request
Jun 30, 2020
[ upstream commit c93459e ] Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi
pushed a commit
that referenced
this pull request
Jun 30, 2020
[ upstream commit c93459e ] Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer
pushed a commit
that referenced
this pull request
Jun 30, 2020
[ upstream commit c93459e ] Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As an attempt to debug a flake in the runtime privileged tests where the DNS request times out, runs the relevant test many times (10^6) and observe the RTT duration to see if it sometimes get longer than the current timeout value used for the server (100ms).
On my local setup I get 1 run out of 20k at more than 10ms, and 1 in 1M above 30ms, so the 100ms seems way enough... But let's see what Jenkins says.
Tweak test files and Makefile to run just the function we want, and make sure we fail at the end so we get logs.