Skip to content

Hold all nvidia software to the same version#4458

Merged
samskillman merged 1 commit into
GoogleCloudPlatform:release-candidatefrom
samskillman:fix/resolve-version-mismatch
Jul 25, 2025
Merged

Hold all nvidia software to the same version#4458
samskillman merged 1 commit into
GoogleCloudPlatform:release-candidatefrom
samskillman:fix/resolve-version-mismatch

Conversation

@samskillman

Copy link
Copy Markdown
Collaborator

Without this, during any combination of "update & upgrade", parts of the nvidia software stack are liable to be upgraded and become out of sync. While only libnvidia-compute-570-server causes immediate errors, it is best to keep everything in sync with the image until a point where an upgrade across all instances can be done.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@samskillman samskillman requested a review from nick-stroud July 25, 2025 21:33
@samskillman samskillman requested a review from a team as a code owner July 25, 2025 21:34
@samskillman samskillman added the release-bugfix Added to release notes under the "Bug fixes" heading. label Jul 25, 2025
@samskillman samskillman force-pushed the fix/resolve-version-mismatch branch from 8d54ae3 to 075f4fe Compare July 25, 2025 21:43
Comment thread examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml Outdated
@samskillman samskillman force-pushed the fix/resolve-version-mismatch branch 2 times, most recently from c20e9a3 to 96b935d Compare July 25, 2025 22:11
Without this, during any combination of "update & upgrade", parts
of the nvidia software stack are liable to be upgraded and become
out of sync. While only libnvidia-compute-570-server causes immediate
errors, it is best to keep everything in sync with the image until
a point where an upgrade across all instances can be done.
@samskillman samskillman force-pushed the fix/resolve-version-mismatch branch from 96b935d to 3289f0b Compare July 25, 2025 22:12
@samskillman samskillman merged commit 5540f11 into GoogleCloudPlatform:release-candidate Jul 25, 2025
12 of 63 checks passed
@samskillman samskillman deleted the fix/resolve-version-mismatch branch July 25, 2025 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants