Skip to content

test: Add simple retries for flaky Helm operations#11762

Merged
nebril merged 1 commit intocilium:masterfrom
christarazi:pr/christarazi/add-retries-helm-k8s-tests
Jun 2, 2020
Merged

test: Add simple retries for flaky Helm operations#11762
nebril merged 1 commit intocilium:masterfrom
christarazi:pr/christarazi/add-retries-helm-k8s-tests

Conversation

@christarazi
Copy link
Copy Markdown
Member

@christarazi christarazi commented May 28, 2020

This PR adds retry logic to Helm operations. It is a low-risk, cheap attempt to help reduce flakes in the case of network errors such as timeouts.

See commit msgs.

@christarazi christarazi added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/ci This PR makes changes to the CI. labels May 28, 2020
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@christarazi
Copy link
Copy Markdown
Member Author

test-me-please

@coveralls
Copy link
Copy Markdown

coveralls commented May 28, 2020

Coverage Status

Coverage increased (+0.01%) to 36.918% when pulling 67c6ed2 on christarazi:pr/christarazi/add-retries-helm-k8s-tests into b7be1c0 on cilium:master.

@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented May 28, 2020

d170fcb01d3c5cf233808aeaa8889a86da380b01 affects K8s CI tests and they have passed

@christarazi christarazi changed the title test: Wrap helm inside Eventually clauses test: Add simple retries for common flakey operations May 28, 2020
@christarazi
Copy link
Copy Markdown
Member Author

test-me-please

@christarazi christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from 8c4f78f to eca7ff4 Compare May 28, 2020 22:50
@christarazi
Copy link
Copy Markdown
Member Author

test-me-please

@christarazi christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from eca7ff4 to a197c04 Compare May 29, 2020 04:58
@christarazi
Copy link
Copy Markdown
Member Author

retest-runtime

Copy link
Copy Markdown
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find if Helm has a retry mechanism builtin, so 👍 on first commit. docker pull does retry on connection breakages though (tried locally by blocking outgoing connection for a couple minutes). I don't think the second commit is worth it; it's just likely to take more time whenever the connection is persistently down.

@christarazi
Copy link
Copy Markdown
Member Author

I couldn't find if Helm has a retry mechanism builtin, so +1 on first commit. docker pull does retry on connection breakages though (tried locally by blocking outgoing connection for a couple minutes). I don't think the second commit is worth it; it's just likely to take more time whenever the connection is persistently down.

Great, thanks for confirming that. Will remove.

@christarazi christarazi changed the title test: Add simple retries for common flakey operations test: Add simple retries for common flaky operations May 29, 2020
This commit is an attempt to add retry logic to Helm operations in the
Kubernetes test suite.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/add-retries-helm-k8s-tests branch from a197c04 to 67c6ed2 Compare June 1, 2020 05:00
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 1, 2020

test-me-please

Edit: K8s-1.11-Kernel-netnext provisioning failure

@christarazi christarazi changed the title test: Add simple retries for common flaky operations test: Add simple retries for flaky Helm operations Jun 1, 2020
@christarazi
Copy link
Copy Markdown
Member Author

christarazi commented Jun 1, 2020

retest-net-next

Edit: K8s-1.11-Kernel-netnext provisioning failure

@christarazi christarazi marked this pull request as ready for review June 1, 2020 07:22
@christarazi christarazi requested a review from a team as a code owner June 1, 2020 07:22
@errordeveloper
Copy link
Copy Markdown
Contributor

Just for the record, can we clarify - is this most of the time due to chart repo availability? I've seen 500s occasionally.

@christarazi
Copy link
Copy Markdown
Member Author

retest-net-next

@christarazi
Copy link
Copy Markdown
Member Author

Just for the record, can we clarify - is this most of the time due to chart repo availability? I've seen 500s occasionally.

I have not personally seen these failures, but I was asked to help with this. If you have seen them and they manifest themselves as 500s, I'd be happy to clarify the PR or the commit.

@christarazi
Copy link
Copy Markdown
Member Author

retest-4.19

@errordeveloper
Copy link
Copy Markdown
Contributor

@christarazi I've not seen issues in CI context actually, I just recall seeing occasional 500s when running helm repo add, and I know that GitHub pages can be a little unreliable at times.

@b3a-dev
Copy link
Copy Markdown
Contributor

b3a-dev commented Jun 1, 2020

@errordeveloper , @christarazi an example of this in CI context: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-k8s1.11-Pipeline/453/testReport/junit/Suite-k8s-1/11/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/

Stderr: Error: looks like "https://helm.cilium.io" is not a valid chart repository or cannot be reached: Get https://helm.cilium.io/index.yaml: dial tcp: lookup helm.cilium.io on 147.75.207.208:53: read udp 147.75.69.147:52976->147.75.207.208:53: i/o timeout

@errordeveloper
Copy link
Copy Markdown
Contributor

@b3a-dev that is a DNS issue actually, if that is what is we are seeing, there is bigger fish to fry (which is not really news to me)...

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jun 1, 2020

@b3a-dev that is a DNS issue actually, if that is what is we are seeing, there is bigger fish to fry (which is not really news to me)...

@errordeveloper Could you elaborate? That looks like a connectivity blip with the DNS lookup first hitting it. The retry doesn't seem like such a bad idea to handle that case.

@errordeveloper
Copy link
Copy Markdown
Contributor

@errordeveloper Could you elaborate? That looks like a connectivity blip with the DNS lookup first hitting it. The retry doesn't seem like such a bad idea to handle that case.

For sure, retrying is a good fix in any case! I just mean that we should also stop using unriliable DNS providers, we should use either 1.1.1.1 or Google DNS.

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jun 1, 2020

I just mean that we should also stop using unriliable DNS providers, we should use either 1.1.1.1 or Google DNS.

Yeah, sure. I don't know what we currently use in VMs and on the hosts, but if it's something flaky, let's switch!

@christarazi
Copy link
Copy Markdown
Member Author

I believe we use 8.8.8.8 (Google DNS)

@christarazi
Copy link
Copy Markdown
Member Author

Also, CI has flaked many times on the Policy tests, and the failures are unrelated to this PR. So this is ready to merge, pending if we'd want to reword the commit or etc.

Copy link
Copy Markdown
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@nebril nebril merged commit 7d26df1 into cilium:master Jun 2, 2020
@christarazi christarazi deleted the pr/christarazi/add-retries-helm-k8s-tests branch June 2, 2020 17:07
@christarazi
Copy link
Copy Markdown
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/ci This PR makes changes to the CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants