Skip to content

Bring parity of functionality to both A3U and A4#4023

Merged
tpdownes merged 3 commits into
GoogleCloudPlatform:developfrom
samskillman:feat/a3u-a4-parity
May 7, 2025
Merged

Bring parity of functionality to both A3U and A4#4023
tpdownes merged 3 commits into
GoogleCloudPlatform:developfrom
samskillman:feat/a3u-a4-parity

Conversation

@samskillman

@samskillman samskillman commented Apr 29, 2025

Copy link
Copy Markdown
Collaborator
  • Accelerator images for A4
  • persistenced enabled for A4
  • NVIDIA repo pinning
  • SocketsPerBoard 2 for A3U
  • Enroot Config for A3U
  • A4 ResumeTimeout Match 1200 from A3U
  • Disk sizes 100GB for compute nodes
  • Add DWS Flex for A3U
  • Incorpriate accelerator image patch for A4

Marking as a breaking change as we are switching the base image being used in A4.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@samskillman samskillman requested a review from a team as a code owner April 29, 2025 22:49
@samskillman samskillman added the release-version-updates Added to release notes under the "Version Updates" heading. label Apr 29, 2025
@samskillman

Copy link
Copy Markdown
Collaborator Author

/gcbrun

@samskillman samskillman force-pushed the feat/a3u-a4-parity branch 4 times, most recently from da16d6b to bcd6972 Compare April 30, 2025 22:39
@samskillman samskillman added release-improvements Added to release notes under the "Improvements" heading. and removed release-version-updates Added to release notes under the "Version Updates" heading. labels Apr 30, 2025
tpdownes
tpdownes previously approved these changes May 1, 2025
@samskillman samskillman added the release-breaking-changes Prevents "smooth" re-deploy across versions label May 1, 2025
@samskillman samskillman marked this pull request as draft May 2, 2025 01:41
@tpdownes tpdownes dismissed their stale review May 2, 2025 02:02

Manual testing has shown problems with NVIDIA library version mismatches

* Accelerator images for A4
* persistenced enabled for A4
* NVIDIA repo pinning
* SocketsPerBoard 2 for A3U
* Enroot Config for A3U
* A4 ResumeTimeout Match 1200 from A3U
* Disk sizes 100GB for compute nodes
* Add DWS Flex for A3U
* Incorpriate accelerator image patch for A4
@samskillman samskillman force-pushed the feat/a3u-a4-parity branch from bcd6972 to 4e35202 Compare May 6, 2025 20:56
@samskillman samskillman marked this pull request as ready for review May 6, 2025 20:56
Comment thread examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml Outdated
Co-authored-by: Tom Downes <tpdownes@users.noreply.github.com>
@samskillman samskillman requested a review from tpdownes May 6, 2025 22:28
tpdownes
tpdownes previously approved these changes May 7, 2025

@tpdownes tpdownes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change request. Please wait until the PR-test-ml-a4-highgpu-slurm test passes to merge.

Comment thread examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml Outdated
….yaml

Co-authored-by: Tom Downes <tpdownes@users.noreply.github.com>

@tpdownes tpdownes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passed before the final commit which only changed whitespace in a comment.

@tpdownes tpdownes merged commit 052abf8 into GoogleCloudPlatform:develop May 7, 2025
13 of 65 checks passed
@samskillman samskillman deleted the feat/a3u-a4-parity branch May 7, 2025 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants