test: Add simple retries for flaky Helm operations by christarazi · Pull Request #11762 · cilium/cilium

christarazi · 2020-05-28T20:31:08Z

This PR adds retry logic to Helm operations. It is a low-risk, cheap attempt to help reduce flakes in the case of network errors such as timeouts.

See commit msgs.

maintainer-s-little-helper · 2020-05-28T20:31:10Z

Please set the appropriate release note label.

christarazi · 2020-05-28T20:31:42Z

test-me-please

coveralls · 2020-05-28T21:43:56Z

Coverage increased (+0.01%) to 36.918% when pulling 67c6ed2 on christarazi:pr/christarazi/add-retries-helm-k8s-tests into b7be1c0 on cilium:master.

christarazi · 2020-05-28T22:14:23Z

d170fcb01d3c5cf233808aeaa8889a86da380b01 affects K8s CI tests and they have passed

christarazi · 2020-05-28T22:18:14Z

test-me-please

christarazi · 2020-05-28T22:50:16Z

test-me-please

christarazi · 2020-05-29T05:12:57Z

retest-runtime

pchaigno

I couldn't find if Helm has a retry mechanism builtin, so 👍 on first commit. docker pull does retry on connection breakages though (tried locally by blocking outgoing connection for a couple minutes). I don't think the second commit is worth it; it's just likely to take more time whenever the connection is persistently down.

christarazi · 2020-05-29T17:04:07Z

I couldn't find if Helm has a retry mechanism builtin, so +1 on first commit. docker pull does retry on connection breakages though (tried locally by blocking outgoing connection for a couple minutes). I don't think the second commit is worth it; it's just likely to take more time whenever the connection is persistently down.

Great, thanks for confirming that. Will remove.

This commit is an attempt to add retry logic to Helm operations in the Kubernetes test suite. Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2020-06-01T05:01:15Z

test-me-please

Edit: K8s-1.11-Kernel-netnext provisioning failure

christarazi · 2020-06-01T05:31:13Z

retest-net-next

Edit: K8s-1.11-Kernel-netnext provisioning failure

errordeveloper · 2020-06-01T09:19:03Z

Just for the record, can we clarify - is this most of the time due to chart repo availability? I've seen 500s occasionally.

christarazi · 2020-06-01T15:45:40Z

retest-net-next

christarazi · 2020-06-01T15:47:37Z

Just for the record, can we clarify - is this most of the time due to chart repo availability? I've seen 500s occasionally.

I have not personally seen these failures, but I was asked to help with this. If you have seen them and they manifest themselves as 500s, I'd be happy to clarify the PR or the commit.

christarazi · 2020-06-01T15:59:47Z

retest-4.19

errordeveloper · 2020-06-01T16:11:29Z

@christarazi I've not seen issues in CI context actually, I just recall seeing occasional 500s when running helm repo add, and I know that GitHub pages can be a little unreliable at times.

b3a-dev · 2020-06-01T16:54:44Z

@errordeveloper , @christarazi an example of this in CI context: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-k8s1.11-Pipeline/453/testReport/junit/Suite-k8s-1/11/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/

Stderr: Error: looks like "https://helm.cilium.io" is not a valid chart repository or cannot be reached: Get https://helm.cilium.io/index.yaml: dial tcp: lookup helm.cilium.io on 147.75.207.208:53: read udp 147.75.69.147:52976->147.75.207.208:53: i/o timeout

errordeveloper · 2020-06-01T17:08:47Z

@b3a-dev that is a DNS issue actually, if that is what is we are seeing, there is bigger fish to fry (which is not really news to me)...

pchaigno · 2020-06-01T17:19:46Z

@b3a-dev that is a DNS issue actually, if that is what is we are seeing, there is bigger fish to fry (which is not really news to me)...

@errordeveloper Could you elaborate? That looks like a connectivity blip with the DNS lookup first hitting it. The retry doesn't seem like such a bad idea to handle that case.

errordeveloper · 2020-06-01T17:30:30Z

@errordeveloper Could you elaborate? That looks like a connectivity blip with the DNS lookup first hitting it. The retry doesn't seem like such a bad idea to handle that case.

For sure, retrying is a good fix in any case! I just mean that we should also stop using unriliable DNS providers, we should use either 1.1.1.1 or Google DNS.

pchaigno · 2020-06-01T17:44:28Z

I just mean that we should also stop using unriliable DNS providers, we should use either 1.1.1.1 or Google DNS.

Yeah, sure. I don't know what we currently use in VMs and on the hosts, but if it's something flaky, let's switch!

christarazi · 2020-06-01T17:45:18Z

I believe we use 8.8.8.8 (Google DNS)

christarazi · 2020-06-01T17:47:15Z

Also, CI has flaked many times on the Policy tests, and the failures are unrelated to this PR. So this is ready to merge, pending if we'd want to reword the commit or etc.

nebril

LGTM, thanks!

christarazi · 2020-06-08T21:22:02Z

Failure in v1.7 backports due to this issue this PR fixes: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/19728/testReport/junit/Suite-k8s-1/17/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/

#11906

Marking as backport to v1.7

christarazi added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/ci This PR makes changes to the CI. labels May 28, 2020

maintainer-s-little-helper Bot added dont-merge/needs-release-note and removed dont-merge/needs-release-note labels May 28, 2020

christarazi changed the title ~~test: Wrap helm inside Eventually clauses~~ test: Add simple retries for common flakey operations May 28, 2020

christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from 8c4f78f to eca7ff4 Compare May 28, 2020 22:50

christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from eca7ff4 to a197c04 Compare May 29, 2020 04:58

pchaigno requested changes May 29, 2020

View reviewed changes

christarazi changed the title ~~test: Add simple retries for common flakey operations~~ test: Add simple retries for common flaky operations May 29, 2020

test: Wrap helm inside Eventually clauses

67c6ed2

This commit is an attempt to add retry logic to Helm operations in the Kubernetes test suite. Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from a197c04 to 67c6ed2 Compare June 1, 2020 05:00

christarazi changed the title ~~test: Add simple retries for common flaky operations~~ test: Add simple retries for flaky Helm operations Jun 1, 2020

christarazi marked this pull request as ready for review June 1, 2020 07:22

christarazi requested a review from a team as a code owner June 1, 2020 07:22

pchaigno approved these changes Jun 1, 2020

View reviewed changes

christarazi added the needs-backport/1.8 label Jun 1, 2020

nebril approved these changes Jun 2, 2020

View reviewed changes

nebril merged commit 7d26df1 into cilium:master Jun 2, 2020

christarazi deleted the pr/christarazi/add-retries-helm-k8s-tests branch June 2, 2020 17:07

tklauser mentioned this pull request Jun 3, 2020

v1.8 backports 2020-06-03 #11856

Merged

tklauser added backport-pending/1.8 and removed needs-backport/1.8 labels Jun 3, 2020

christarazi added the needs-backport/1.7 label Jun 8, 2020

christarazi mentioned this pull request Jun 8, 2020

v1.7 backports 2020-06-04 #11906

Merged

christarazi added backport-pending/1.7 and removed needs-backport/1.7 labels Jun 8, 2020

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Jun 10, 2020

Conversation

christarazi commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maintainer-s-little-helper Bot commented May 28, 2020

Uh oh!

christarazi commented May 28, 2020

Uh oh!

coveralls commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi commented May 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi commented May 28, 2020

Uh oh!

christarazi commented May 28, 2020

Uh oh!

christarazi commented May 29, 2020

Uh oh!

pchaigno left a comment

Choose a reason for hiding this comment

Uh oh!

christarazi commented May 29, 2020

Uh oh!

christarazi commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christarazi commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

errordeveloper commented Jun 1, 2020

Uh oh!

christarazi commented Jun 1, 2020

Uh oh!

christarazi commented Jun 1, 2020

Uh oh!

christarazi commented Jun 1, 2020

Uh oh!

errordeveloper commented Jun 1, 2020

Uh oh!

b3a-dev commented Jun 1, 2020

Uh oh!

errordeveloper commented Jun 1, 2020

Uh oh!

pchaigno commented Jun 1, 2020

Uh oh!

errordeveloper commented Jun 1, 2020

Uh oh!

pchaigno commented Jun 1, 2020

Uh oh!

christarazi commented Jun 1, 2020

Uh oh!

christarazi commented Jun 1, 2020

Uh oh!

nebril left a comment

Choose a reason for hiding this comment

Uh oh!

christarazi commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

christarazi commented May 28, 2020 •

edited

Loading

coveralls commented May 28, 2020 •

edited

Loading

christarazi commented May 28, 2020 •

edited

Loading

christarazi commented Jun 1, 2020 •

edited

Loading

christarazi commented Jun 1, 2020 •

edited

Loading