Skip to content

Bug fixes after v0.2.8 release#368

Merged
ArangoGutierrez merged 6 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/e2e/drivergpu
May 26, 2025
Merged

Bug fixes after v0.2.8 release#368
ArangoGutierrez merged 6 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/e2e/drivergpu

Conversation

@ArangoGutierrez
Copy link
Collaborator

This pull request introduces several changes across multiple areas of the codebase, these changes are fixes to changes introduced during the last release cut.

Error Handling and Status Updates:

  • Added logic to set a "Degraded" condition with detailed error information when provisioning fails, and updated the cache file with the provisioning status in runProvision (cmd/cli/create/create.go).
  • Enhanced the getProviderStatus function to include the reason for a degraded status in the status message (internal/instances/instances.go).

CLI Simplification:

  • Removed the envFile flag and its associated logic from the delete command, simplifying the deletion process to rely solely on the instance-id flag (cmd/cli/delete/delete.go).

Provisioning Templates:

  • Updated the NVIDIA driver installation script to include additional checks (e.g., loading the NVIDIA module and starting nvidia-persistenced) and improved driver validation (pkg/provisioner/templates/nv-driver.go).
  • Adjusted Kubernetes provisioning logic to retry Calico resource creation up to 5 times and updated the legacy Kubernetes version check to v1.32.0 (pkg/provisioner/templates/kubernetes.go). [1] [2]

Testing Enhancements:

  • Introduced additional end-to-end tests for AWS environments, including validation of environment provisioning and Kubernetes cluster setup (tests/aws_test.go).

These changes collectively improve the robustness, usability, and maintainability of the codebase while enhancing the test coverage and reliability of the provisioning process.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the fix/e2e/drivergpu branch 4 times, most recently from 9e86a23 to a33a293 Compare May 26, 2025 10:07
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes several issues from the previous release by updating error handling in provisioning, simplifying the CLI, and improving provisioning and testing logic. Key changes include enhanced degraded condition handling in provisioning, updated installation and validation logic for NVIDIA drivers and Kubernetes resources, and removal of legacy flag usage in the delete command.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/e2e_test.go Added SSH key loading and validation for end-to-end tests
tests/data/*.yml Updated ingress IP ranges in AWS test configuration files
tests/aws_test.go Added end-to-end test enhancements for AWS environments
pkg/provisioner/templates/nv-driver*.go Updated NVIDIA driver installation to use nvidia-driver syntax and added module checks
pkg/provisioner/templates/kubernetes_test.go Adjusted Kubernetes version comparisons and legacy initialization flags
pkg/provisioner/templates/kubernetes.go Increased retry counts for Calico resource creation and updated legacy version check logic
pkg/provisioner/templates/kernel.go Removed extraneous output after initiating reboot during kernel upgrade
internal/instances/instances.go & cmd/cli/list/list.go Removed warnings for cache files without instance IDs
cmd/cli/delete/delete.go Removed the envFile flag and associated processing
cmd/cli/create/create.go Enhanced error handling by setting a degraded condition and updating the cache file on provisioning failure
.github/workflows/e2e.yaml Modified E2E job to securely handle SSH keys via a temporary file
Comments suppressed due to low confidence (2)

cmd/cli/create/create.go:204

  • [nitpick] After updating the cache file with the degraded status, adding an informational log entry could improve traceability for provisioning failures.
if err = p.Run(opts.cfg); err != nil {

pkg/provisioner/templates/kubernetes_test.go:84

  • The test now sets UseLegacyInit to true whereas it was previously false. Please confirm that this change in default behavior is intended and update documentation if necessary.
UseLegacyInit:         true,

@ArangoGutierrez ArangoGutierrez merged commit f5b9d4b into NVIDIA:main May 26, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants