Skip to content

Block broken release of nvidia-container-toolkit#4152

Merged
tpdownes merged 1 commit into
GoogleCloudPlatform:mainfrom
tpdownes:hotfix_nvidia_package
May 20, 2025
Merged

Block broken release of nvidia-container-toolkit#4152
tpdownes merged 1 commit into
GoogleCloudPlatform:mainfrom
tpdownes:hotfix_nvidia_package

Conversation

@tpdownes

Copy link
Copy Markdown
Contributor

The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm.

While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions.

The specific file nvidia-container-cli is installed during the build stage of Slurm, so I believe this should result in new clusters running nvidia-container-toolkit=1.17.6-1 (and other packages with similar names and identical version). When a version above 1.17.7-1 is released, the build process should naturally select those.

This mitigates #4144

These changes were manually tested before/after #4146, which successfully identified the broken functionality.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
@tpdownes tpdownes requested review from a team and samskillman as code owners May 20, 2025 16:49
@tpdownes tpdownes added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 20, 2025
@tpdownes tpdownes enabled auto-merge May 20, 2025 16:57
@tpdownes tpdownes self-assigned this May 20, 2025
@tpdownes tpdownes merged commit 8b7aae6 into GoogleCloudPlatform:main May 20, 2025
14 of 66 checks passed
@tpdownes tpdownes deleted the hotfix_nvidia_package branch May 20, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants