Skip to content

Add a4-high-vm blueprint#3751

Merged
samskillman merged 1 commit into
GoogleCloudPlatform:developfrom
samskillman:feat/a4-highgpu-vm
Mar 5, 2025
Merged

Add a4-high-vm blueprint#3751
samskillman merged 1 commit into
GoogleCloudPlatform:developfrom
samskillman:feat/a4-highgpu-vm

Conversation

@samskillman

Copy link
Copy Markdown
Collaborator

Manually deployed 2-node cluster. NCCL tests succesful with:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ -j
export NCCL_NET=gIB
mpirun -n 16 -N 8 -x NCCL_NET --host a4high-vm-0:8,a4high-vm-1:8 ./build/all_gather_perf -b 8 -e 16G -f 2 -g 1

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@samskillman samskillman requested review from a team and chengcongdu as code owners March 4, 2025 02:46
@samskillman samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Mar 4, 2025

@harshthakkar01 harshthakkar01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VM deployment, we don't have separate variable file to deploy https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning/a3-ultragpu-8g

should we follow same format ?

@samskillman

Copy link
Copy Markdown
Collaborator Author

For VM deployment, we don't have separate variable file to deploy https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning/a3-ultragpu-8g

should we follow same format ?

I prefer having a deployment file. We could/should backport to a3-ultra.

@samskillman samskillman force-pushed the feat/a4-highgpu-vm branch 2 times, most recently from 2822c7c to 047ec6f Compare March 5, 2025 02:51
Comment thread examples/machine-learning/a4-highgpu-8g/a4high-vm.yaml Outdated
@harshthakkar01 harshthakkar01 enabled auto-merge March 5, 2025 17:59
@samskillman samskillman disabled auto-merge March 5, 2025 17:59
@samskillman samskillman enabled auto-merge March 5, 2025 17:59
@samskillman samskillman merged commit cc08bee into GoogleCloudPlatform:develop Mar 5, 2025
@samskillman samskillman deleted the feat/a4-highgpu-vm branch March 5, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants