Normalize retry/timeouts for kubernetes installation#385
Normalize retry/timeouts for kubernetes installation#385ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR standardizes retry counts and timeouts across Kubernetes provisioning steps to normalize total wait durations and improve transient failure handling. Key changes include:
- Increased delay parameters for
with_retrycalls in the Kubernetes template - Refactored
with_retryimplementation to capture output and detect transient errors - Aligned retry count and interval in Go’s
waitForNodeRebootto match total timeout
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/provisioner/templates/kubernetes.go | Updated all with_retry calls from 20s to 30s, added ` |
| pkg/provisioner/templates/common.go | Rewrote with_retry loop: renamed counter, captured command output, and added transient error detection |
| pkg/provisioner/provisioner.go | Changed waitForNodeReboot from 30×10s retries to 10×30s |
Comments suppressed due to low confidence (2)
pkg/provisioner/templates/common.go:85
- Always returning
1when retries are exhausted loses the original exit code, which can be useful for diagnostics. Consider returning the last command’s exit code ($rc) instead.
return 1
pkg/provisioner/provisioner.go:100
- [nitpick] The loop variable
iis generic; consider renaming toattemptorretryCountto clarify its purpose in retry logic.
for i := 0; i < maxRetries; i++ {
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull Request Overview
This PR standardizes retry counts and timeouts for Kubernetes installation steps, adds an explicit API server availability check, and aligns node-reboot retry logic.
- Unifies retry intervals to 30s (up from 20s) across Calico installation and CRD waits
- Adds explicit
kubectl versionretry to ensure API server readiness - Wraps node taint/label commands in
with_retryand suppresses errors; adjusts reboot retry loop
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/provisioner/templates/kubernetes.go | Increased retry intervals, added API server wait, and applied retries with error suppression for taint/label steps |
| pkg/provisioner/provisioner.go | Reduced maxRetries to 10 and increased retryInterval to 30s for node-reboot polling |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
There was a problem hiding this comment.
Pull Request Overview
This PR standardizes retry counts and timeouts for Kubernetes installation commands and improves version normalization logic.
- Increased retry intervals from 20s to 30s for kubectl commands and node reboot waits
- Added explicit wait for kube-apiserver availability
- Normalized Kubernetes and MicroK8s version handling, stripping leading “v” and defaulting when empty
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/provisioner/templates/kubernetes.go | Updated with_retry timeouts, added kube-apiserver wait, enhanced version templating for microk8s |
| pkg/provisioner/provisioner.go | Adjusted node reboot retry parameters (interval and count) |
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
shivakunv
left a comment
There was a problem hiding this comment.
LGTM . except for some copilot comments
Pull Request Test Coverage Report for Build 15389641778Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
No description provided.