Better integrate, optimize, and control the gIB NCCL RDMA plugin installer#4069
Conversation
…ule (similar to other systems like Jobset and Kueue)
…a one-by-one node update of the DaemonSet can take hours on large clusters (10+ second image pull time)
…p elements must have the same type and templates will commonly need multiple types (e.g., a nodeSelectors list and a version string)
…NCCL RDMA installer template
…, to help avoid environment inconsistency issues
…Set node affinity) in the a3-ultra and a4 blueprints
…se (repo convention)
ighosh98
left a comment
There was a problem hiding this comment.
The nccl-installer is generally tied to the machine family architecture. I don't think we should be moving it into the manifests folder. For instance we may have a different number of GPUs in a future machine family which can impact parts of the nccl installer
For example: Take a look at this nccl installer
We need to think of a different solution here.
… support future subshapes
…64, arm64, amd64-diagnostic, and arm64-diagnostic)
…cheduling requirements in the gIB installer, via a variable for nodeAffinity (and sensible gpu=true default value)
|
Thanks @ighosh98. Future compatibility around subshapes and cpu architectures is not a lot of parameterization in my opinion - added and tested a couple of additional commits, and they cover the full installer diffs. I personally think this now represents the right amount of configurability, but with significant code similarities/duplication making this still worth templatizing in the Thoughts? |
I agree with the idea to templatize the nccl installer. I would prefer that we can define the path to be tied strictly to a machine family and have unique nccl installers per machine family. The two obvious solutions are:
Would have to think how to implement one of these solutions. I am inclined to use the third approach as it seems like the easiest one to implement, fairly intuitive to use and is very similar to the kueue configuration solution we have in place today. Let me know if this approach sounds good to you. |
…lders, but retain the new controls and templatization
|
@ighosh98 Sounds good to me! Revised and retested PR is ready for your review. |
ighosh98
left a comment
There was a problem hiding this comment.
Instead of hard coding values, please use the config_template_var constuct. We have used a similar solution for wiring Kueue configurations
Rest looks good to me.
|
Sharing code pointers on how Kueue integration implemented the construct: |
|
/gcbrun |
ighosh98
left a comment
There was a problem hiding this comment.
LGTM
Let's wait for the tests to pass before merging.
…erator count, if the gIB installer is used
|
@ighosh98 Thanks for the comments and call. Made that last change that we discussed - to force specification of the gIB version and accelerator count in the blueprint. |
ighosh98
left a comment
There was a problem hiding this comment.
Will wait for all tests to pass before we merge the PR.
|
/gcbrun |
This PR enables faster NCCL RDMA upgrades, a subtle warning for gIB version changes, and integration of the gIB installer code into the
kubectl-applymodule.Manually tested changes and a variety of scenarios using Terraform v1.11.4.
Faster NCCL RDMA installation
Previously, when the gIB NCCL plugin version was changed, the rolling update strategy used a
maxUnavailableof 1 (the default value). This means that each node is updated one-by-one. With image pull times often exceeding 10 seconds, the rollout can leave large clusters in a broken, mixed-NCCL state for hours. Increasing themaxUnavailableto 16 reduces this rollout time by over an order of magnitude, while still maintaining the gradual/rolling update characteristics.Subtle warning for gIB version changes
An environment with inconsistent NCCL versions can lead to tricky errors. As such, NCCL version changes need to be handled carefully. While the DaemonSet rollout and installer script should fully and cleanly update the environment when the installer image is upgraded, potential unmanaged NCCL environment variables/configurations and other "residuals" could cause issues. This is not so large a concern as to warrant deployment blocking, so a
checkblock is added to give a warning when the gIB version changes. Like othercheckblocks in the repo, this warning is wedged between the terraform apply logs and the blueprint outputs after agcluster deploy. I don't believe the warning needs more emphasis than this.Integration of the gIB installer code
While largely a prerequisite for the previously-mentioned changes, I believe there's some value in this in it's own right. The integration of the gIB installer code into the
kubectl-applymodule seems like a cleaner and easier UX, while still fully allowing for the old approach of customapply_manifests. The NCCL RDMA installer manifests across the a3-ultra and a4 blueprints are also identical, other than thegke-acceleratornode affinity, so there's a bit of code deduplication value here.There's strong precedent for this sort of shift in the
kubectl-applymodule already (Jobset, Kueue,. etc.).Additional design notes
gib.installis defaultfalse) or blueprints (apply_manifestsworks like it always has).template_varsdata type (kubectl-applymodule) was required for the gIB NCCL installer integration, but I believe this is a sensible change in it's own right to enable more advanced templating on top of the manifests. Since the data typing was relaxed, rather than changed, this should not introduce any issues.accelerator_selectorvariable is not supplied. However, anaccelerator_selectorsetting is explicitly provided in the a3-ultra and a4 blueprints for clarity.Submission Checklist