test: Add simple retries for flaky Helm operations#11762
test: Add simple retries for flaky Helm operations#11762nebril merged 1 commit intocilium:masterfrom
Conversation
|
Please set the appropriate release note label. |
|
test-me-please |
|
d170fcb01d3c5cf233808aeaa8889a86da380b01 affects K8s CI tests and they have passed |
|
test-me-please |
8c4f78f to
eca7ff4
Compare
|
test-me-please |
eca7ff4 to
a197c04
Compare
|
retest-runtime |
pchaigno
left a comment
There was a problem hiding this comment.
I couldn't find if Helm has a retry mechanism builtin, so 👍 on first commit. docker pull does retry on connection breakages though (tried locally by blocking outgoing connection for a couple minutes). I don't think the second commit is worth it; it's just likely to take more time whenever the connection is persistently down.
Great, thanks for confirming that. Will remove. |
This commit is an attempt to add retry logic to Helm operations in the Kubernetes test suite. Signed-off-by: Chris Tarazi <chris@isovalent.com>
a197c04 to
67c6ed2
Compare
|
test-me-please Edit: K8s-1.11-Kernel-netnext provisioning failure |
|
retest-net-next Edit: K8s-1.11-Kernel-netnext provisioning failure |
|
Just for the record, can we clarify - is this most of the time due to chart repo availability? I've seen 500s occasionally. |
|
retest-net-next |
I have not personally seen these failures, but I was asked to help with this. If you have seen them and they manifest themselves as 500s, I'd be happy to clarify the PR or the commit. |
|
retest-4.19 |
|
@christarazi I've not seen issues in CI context actually, I just recall seeing occasional 500s when running |
|
@errordeveloper , @christarazi an example of this in CI context: https://jenkins.cilium.io/job/Ginkgo-CI-Tests-k8s1.11-Pipeline/453/testReport/junit/Suite-k8s-1/11/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/
|
|
@b3a-dev that is a DNS issue actually, if that is what is we are seeing, there is bigger fish to fry (which is not really news to me)... |
@errordeveloper Could you elaborate? That looks like a connectivity blip with the DNS lookup first hitting it. The retry doesn't seem like such a bad idea to handle that case. |
For sure, retrying is a good fix in any case! I just mean that we should also stop using unriliable DNS providers, we should use either 1.1.1.1 or Google DNS. |
Yeah, sure. I don't know what we currently use in VMs and on the hosts, but if it's something flaky, let's switch! |
|
I believe we use 8.8.8.8 (Google DNS) |
|
Also, CI has flaked many times on the Policy tests, and the failures are unrelated to this PR. So this is ready to merge, pending if we'd want to reword the commit or etc. |
|
Failure in v1.7 backports due to this issue this PR fixes: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/19728/testReport/junit/Suite-k8s-1/17/K8sUpdates_Tests_upgrade_and_downgrade_from_a_Cilium_stable_image_to_master/ Marking as backport to v1.7 |
This PR adds retry logic to Helm operations. It is a low-risk, cheap attempt to help reduce flakes in the case of network errors such as timeouts.
See commit msgs.