Skip to content

Better integrate, optimize, and control the gIB NCCL RDMA plugin installer#4069

Merged
ighosh98 merged 13 commits into
GoogleCloudPlatform:developfrom
ndebuhr:feat/tighter-gib-nccl-integration
May 9, 2025
Merged

Better integrate, optimize, and control the gIB NCCL RDMA plugin installer#4069
ighosh98 merged 13 commits into
GoogleCloudPlatform:developfrom
ndebuhr:feat/tighter-gib-nccl-integration

Conversation

@ndebuhr

@ndebuhr ndebuhr commented May 6, 2025

Copy link
Copy Markdown
Contributor

This PR enables faster NCCL RDMA upgrades, a subtle warning for gIB version changes, and integration of the gIB installer code into the kubectl-apply module.

Manually tested changes and a variety of scenarios using Terraform v1.11.4.

Faster NCCL RDMA installation

Previously, when the gIB NCCL plugin version was changed, the rolling update strategy used a maxUnavailable of 1 (the default value). This means that each node is updated one-by-one. With image pull times often exceeding 10 seconds, the rollout can leave large clusters in a broken, mixed-NCCL state for hours. Increasing the maxUnavailable to 16 reduces this rollout time by over an order of magnitude, while still maintaining the gradual/rolling update characteristics.

Subtle warning for gIB version changes

An environment with inconsistent NCCL versions can lead to tricky errors. As such, NCCL version changes need to be handled carefully. While the DaemonSet rollout and installer script should fully and cleanly update the environment when the installer image is upgraded, potential unmanaged NCCL environment variables/configurations and other "residuals" could cause issues. This is not so large a concern as to warrant deployment blocking, so a check block is added to give a warning when the gIB version changes. Like other check blocks in the repo, this warning is wedged between the terraform apply logs and the blueprint outputs after a gcluster deploy. I don't believe the warning needs more emphasis than this.

Integration of the gIB installer code

While largely a prerequisite for the previously-mentioned changes, I believe there's some value in this in it's own right. The integration of the gIB installer code into the kubectl-apply module seems like a cleaner and easier UX, while still fully allowing for the old approach of custom apply_manifests. The NCCL RDMA installer manifests across the a3-ultra and a4 blueprints are also identical, other than the gke-accelerator node affinity, so there's a bit of code deduplication value here.

There's strong precedent for this sort of shift in the kubectl-apply module already (Jobset, Kueue,. etc.).

Additional design notes

  1. These changes do not affect default behavior of existing modules (gib.install is default false) or blueprints (apply_manifests works like it always has).
  2. Loosening up the template_vars data type (kubectl-apply module) was required for the gIB NCCL installer integration, but I believe this is a sensible change in it's own right to enable more advanced templating on top of the manifests. Since the data typing was relaxed, rather than changed, this should not introduce any issues.
  3. The default accelerator selector is set to the superset of all relevant accelerators (a3-ultra and a4) so that the setup still works if an accelerator_selector variable is not supplied. However, an accelerator_selector setting is explicitly provided in the a3-ultra and a4 blueprints for clarity.

Submission Checklist

  • ✅ Fork your PR branch from the Toolkit "develop" branch (not main)
  • ✅ Test all changes with pre-commit in a local branch #
  • ✅ Confirm that "make tests" passes all tests
  • ✅ Add or modify unit tests to cover code changes
  • ✅ Ensure that unit test coverage remains above 80%
  • ✅ Update all applicable documentation
  • ✅ Follow Cluster Toolkit Contribution guidelines #

ndebuhr added 7 commits May 5, 2025 15:30
…ule (similar to other systems like Jobset and Kueue)
…a one-by-one node update of the DaemonSet can take hours on large clusters (10+ second image pull time)
…p elements must have the same type and templates will commonly need multiple types (e.g., a nodeSelectors list and a version string)
…, to help avoid environment inconsistency issues
…Set node affinity) in the a3-ultra and a4 blueprints
@ndebuhr ndebuhr requested review from a team and samskillman as code owners May 6, 2025 14:37

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nccl-installer is generally tied to the machine family architecture. I don't think we should be moving it into the manifests folder. For instance we may have a different number of GPUs in a future machine family which can impact parts of the nccl installer

For example: Take a look at this nccl installer
We need to think of a different solution here.

Comment thread modules/management/kubectl-apply/main.tf
ndebuhr added 3 commits May 6, 2025 21:20
…64, arm64, amd64-diagnostic, and arm64-diagnostic)
…cheduling requirements in the gIB installer, via a variable for nodeAffinity (and sensible gpu=true default value)
@ndebuhr

ndebuhr commented May 6, 2025

Copy link
Copy Markdown
Contributor Author

Thanks @ighosh98. Future compatibility around subshapes and cpu architectures is not a lot of parameterization in my opinion - added and tested a couple of additional commits, and they cover the full installer diffs.

I personally think this now represents the right amount of configurability, but with significant code similarities/duplication making this still worth templatizing in the kubectl-apply module. We can always push folks to the existing apply_manifests for weird one-off cases.

Thoughts?

@ighosh98 ighosh98 added release-key-new-features Added to release notes under the "Key New Features" heading. release-module-improvements Added to release notes under the "Module Improvements" heading. labels May 7, 2025
@ighosh98

ighosh98 commented May 7, 2025

Copy link
Copy Markdown
Contributor

Thanks @ighosh98. Future compatibility around subshapes and cpu architectures is not a lot of parameterization in my opinion - added and tested a couple of additional commits, and they cover the full installer diffs.

I personally think this now represents the right amount of configurability, but with significant code similarities/duplication making this still worth templatizing in the kubectl-apply module. We can always push folks to the existing apply_manifests for weird one-off cases.

Thoughts?

I agree with the idea to templatize the nccl installer. I would prefer that we can define the path to be tied strictly to a machine family and have unique nccl installers per machine family. The two obvious solutions are:

  1. We make the apply manifest block a bit more generic and allow users to pass params corresponding to the manifest as well.
  2. We need to have a mechanism to dynamically identify the folder path for a machine family the user wants to deploy.
  3. We should have a mechanism that allows users to specify the desired path in the manifest as a parameter. Ideally this should be in the chosen folder itself. Or atleast they should be separated and tied strictly to a cluster type.

Would have to think how to implement one of these solutions.

I am inclined to use the third approach as it seems like the easiest one to implement, fairly intuitive to use and is very similar to the kueue configuration solution we have in place today. Let me know if this approach sounds good to you.

@ndebuhr

ndebuhr commented May 7, 2025

Copy link
Copy Markdown
Contributor Author

@ighosh98 Sounds good to me! Revised and retested PR is ready for your review.

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard coding values, please use the config_template_var constuct. We have used a similar solution for wiring Kueue configurations

Rest looks good to me.

Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf Outdated
@ighosh98

ighosh98 commented May 7, 2025

Copy link
Copy Markdown
Contributor

@ighosh98

ighosh98 commented May 7, 2025

Copy link
Copy Markdown
Contributor

/gcbrun

ighosh98
ighosh98 previously approved these changes May 7, 2025

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Let's wait for the tests to pass before merging.

Comment thread examples/gke-a4/nccl-installer.yaml.tftpl
@ndebuhr

ndebuhr commented May 7, 2025

Copy link
Copy Markdown
Contributor Author

@ighosh98 Thanks for the comments and call. Made that last change that we discussed - to force specification of the gIB version and accelerator count in the blueprint.

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for all tests to pass before we merge the PR.

@ighosh98

ighosh98 commented May 7, 2025

Copy link
Copy Markdown
Contributor

/gcbrun

@ighosh98 ighosh98 enabled auto-merge May 7, 2025 20:25
@ighosh98 ighosh98 disabled auto-merge May 7, 2025 20:26
@SwarnaBharathiMantena SwarnaBharathiMantena self-requested a review May 9, 2025 17:15
@ighosh98 ighosh98 merged commit c6fa3d3 into GoogleCloudPlatform:develop May 9, 2025
13 of 66 checks passed
@ndebuhr ndebuhr deleted the feat/tighter-gib-nccl-integration branch May 10, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading. release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants