Skip to content

Normalize retry/timeouts for kubernetes installation#385

Merged
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:v0212
Jun 2, 2025
Merged

Normalize retry/timeouts for kubernetes installation#385
ArangoGutierrez merged 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:v0212

Conversation

@ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes retry counts and timeouts across Kubernetes provisioning steps to normalize total wait durations and improve transient failure handling. Key changes include:

  • Increased delay parameters for with_retry calls in the Kubernetes template
  • Refactored with_retry implementation to capture output and detect transient errors
  • Aligned retry count and interval in Go’s waitForNodeReboot to match total timeout

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/provisioner/templates/kubernetes.go Updated all with_retry calls from 20s to 30s, added `
pkg/provisioner/templates/common.go Rewrote with_retry loop: renamed counter, captured command output, and added transient error detection
pkg/provisioner/provisioner.go Changed waitForNodeReboot from 30×10s retries to 10×30s
Comments suppressed due to low confidence (2)

pkg/provisioner/templates/common.go:85

  • Always returning 1 when retries are exhausted loses the original exit code, which can be useful for diagnostics. Consider returning the last command’s exit code ($rc) instead.
return 1

pkg/provisioner/provisioner.go:100

  • [nitpick] The loop variable i is generic; consider renaming to attempt or retryCount to clarify its purpose in retry logic.
for i := 0; i < maxRetries; i++ {

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes retry counts and timeouts for Kubernetes installation steps, adds an explicit API server availability check, and aligns node-reboot retry logic.

  • Unifies retry intervals to 30s (up from 20s) across Calico installation and CRD waits
  • Adds explicit kubectl version retry to ensure API server readiness
  • Wraps node taint/label commands in with_retry and suppresses errors; adjusts reboot retry loop

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/provisioner/templates/kubernetes.go Increased retry intervals, added API server wait, and applied retries with error suppression for taint/label steps
pkg/provisioner/provisioner.go Reduced maxRetries to 10 and increased retryInterval to 30s for node-reboot polling

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez requested review from Copilot and elezar June 2, 2025 12:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes retry counts and timeouts for Kubernetes installation commands and improves version normalization logic.

  • Increased retry intervals from 20s to 30s for kubectl commands and node reboot waits
  • Added explicit wait for kube-apiserver availability
  • Normalized Kubernetes and MicroK8s version handling, stripping leading “v” and defaulting when empty

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/provisioner/templates/kubernetes.go Updated with_retry timeouts, added kube-apiserver wait, enhanced version templating for microk8s
pkg/provisioner/provisioner.go Adjusted node reboot retry parameters (interval and count)

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Contributor

@shivakunv shivakunv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM . except for some copilot comments

@ArangoGutierrez ArangoGutierrez merged commit c1c5e8c into NVIDIA:main Jun 2, 2025
19 checks passed
@coveralls
Copy link

coveralls commented Jun 3, 2025

Pull Request Test Coverage Report for Build 15389641778

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 7 of 11 (63.64%) changed or added relevant lines in 2 files are covered.
  • 23 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 14.98%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provisioner/provisioner.go 0 2 0.0%
pkg/provisioner/templates/kubernetes.go 7 9 77.78%
Files with Coverage Reduction New Missed Lines %
pkg/provisioner/templates/kubernetes.go 23 52.69%
Totals Coverage Status
Change from base Build 15363939553: 0.0%
Covered Lines: 262
Relevant Lines: 1749

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants