Skip to content

[DEBUG] proxy_test: evaluate rtt value for DNS requests#12298

Closed
qmonnet wants to merge 1 commit intocilium:masterfrom
qmonnet:test/qmonnet/privileged_tests_dnsproxy
Closed

[DEBUG] proxy_test: evaluate rtt value for DNS requests#12298
qmonnet wants to merge 1 commit intocilium:masterfrom
qmonnet:test/qmonnet/privileged_tests_dnsproxy

Conversation

@qmonnet
Copy link
Copy Markdown
Member

@qmonnet qmonnet commented Jun 26, 2020

As an attempt to debug a flake in the runtime privileged tests where the DNS request times out, runs the relevant test many times (10^6) and observe the RTT duration to see if it sometimes get longer than the current timeout value used for the server (100ms).

On my local setup I get 1 run out of 20k at more than 10ms, and 1 in 1M above 30ms, so the 100ms seems way enough... But let's see what Jenkins says.

Tweak test files and Makefile to run just the function we want, and make sure we fail at the end so we get logs.

@qmonnet qmonnet added dont-merge/preview-only Only for preview or testing, don't merge it. release-note/misc This PR makes changes that have no direct user impact. labels Jun 26, 2020
@qmonnet
Copy link
Copy Markdown
Member Author

qmonnet commented Jun 26, 2020

test-only --focus="RuntimePrivilegedUnitTests"
[So apparently this isn't supported for runtime tests yet 😢]

@coveralls
Copy link
Copy Markdown

coveralls commented Jun 26, 2020

Coverage Status

Coverage increased (+0.02%) to 36.942% when pulling d6391e8 on qmonnet:test/qmonnet/privileged_tests_dnsproxy into 9dfdbda on cilium:master.

@qmonnet
Copy link
Copy Markdown
Member Author

qmonnet commented Jun 26, 2020

test-focus RuntimePrivilegedUnitTests

As an attempt to debug a flake in the runtime privileged tests where the
DNS request times out, runs the relevant test many times (10^6) and
observe the RTT duration to see if it sometimes get longer than the
current timeout value used for the server (100ms).

On my local setup I get 1 run out of 20k at more than 10ms, and 1 in 1M
above 30ms, so the 100ms seems way enough... But let's see what Jenkins
says.

Tweak test files and Makefile to run just the function we want, and make
sure we fail at the end so we get logs.

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet qmonnet force-pushed the test/qmonnet/privileged_tests_dnsproxy branch from 459032c to d6391e8 Compare June 26, 2020 10:40
@qmonnet
Copy link
Copy Markdown
Member Author

qmonnet commented Jun 26, 2020

test-focus RuntimePrivilegedUnitTests

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated-Focus/319/
cbaab92c_RuntimePrivilegedUnitTests_Run_Tests.zip

Tried to run the test a million time; It stopped at 790k, having exhausted the available memory. Under heavy memory pressure, timeout rose frequently to 100ms+, and up to 296ms.

qmonnet added a commit that referenced this pull request Jun 26, 2020
Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
aanm pushed a commit that referenced this pull request Jun 29, 2020
Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
@qmonnet qmonnet closed this Jun 29, 2020
@qmonnet qmonnet deleted the test/qmonnet/privileged_tests_dnsproxy branch June 29, 2020 12:19
christarazi pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit c93459e ]

Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit c93459e ]

Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit c93459e ]

Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
joestringer pushed a commit that referenced this pull request Jun 30, 2020
[ upstream commit c93459e ]

Under heavy load, the round-trip-time (RTT) for DNS requests between a
TCP client and a DNS proxy may exceed the 100ms timeout specified when
creating the client in the dnsproxy tests.

This was observed on the test-PR #12298, with a RTT value going up to
296ms (under exceptional memory strain).

This might be the cause for the rare flakes reported in #12042. Let's
increase this timeout. The timeout is only used a couple of times in the
tests, so increasing it by a few hundred milliseconds would have no
visible impact. And because we expect all requests from the TCP client
to succeed on the L4 anyway (i.e. it should never time out in our
tests), this should not prolong at all the execution of tests in the
normal case.

Let's also retrieve and print the RTT value for that request in case of
error, to get more info if this change were not enough to fix the flake.

Hopefully fixes: #12042
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dont-merge/preview-only Only for preview or testing, don't merge it. release-note/misc This PR makes changes that have no direct user impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants