Skip to content

Block broken release of nvidia-container-toolkit#4145

Merged
tpdownes merged 1 commit into
GoogleCloudPlatform:release-candidatefrom
tpdownes:block_nvidia_package
May 20, 2025
Merged

Block broken release of nvidia-container-toolkit#4145
tpdownes merged 1 commit into
GoogleCloudPlatform:release-candidatefrom
tpdownes:block_nvidia_package

Conversation

@tpdownes

@tpdownes tpdownes commented May 19, 2025

Copy link
Copy Markdown
Contributor

The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm.

While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions.

The specific file nvidia-container-cli is installed during the build stage of Slurm, so I believe this should result in new clusters running nvidia-container-toolkit=1.17.6-1 (and other packages with similar names and identical version). When a version above 1.17.7-1 is released, the build process should naturally select those.

This mitigates #4144

These changes were manually tested before/after #4146, which successfully identified the broken functionality.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
@tpdownes tpdownes requested a review from samskillman as a code owner May 19, 2025 18:44
@tpdownes tpdownes added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 19, 2025
@tpdownes tpdownes requested a review from a team as a code owner May 19, 2025 18:44
@tpdownes tpdownes merged commit 6ecc75b into GoogleCloudPlatform:release-candidate May 20, 2025
21 of 69 checks passed
@tpdownes tpdownes deleted the block_nvidia_package branch May 20, 2025 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants