Revamp GKE A3 High blueprint and align integration tests#5246
Conversation
Summary of ChangesHello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly modernizes the GKE A3 High blueprint by reorganizing its structure, enhancing its documentation, and integrating advanced features. The changes aim to align the blueprint with the standards of other GKE A* family offerings, providing users with a more robust and feature-rich solution for high-performance ML training. Key additions include support for Kueue's Topology Aware Scheduling for efficient GPU workload management and the implementation of Cluster Health Services for automated GPU diagnostics, ensuring optimal cluster performance and reliability. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and valuable modernization of the GKE A3 High blueprint. The restructuring into a dedicated directory, addition of comprehensive documentation, and integration of advanced features like Kueue for Topology Aware Scheduling and Cluster Health Services (CHS) are excellent improvements that align it with the standards of other A* family blueprints.
My review focuses on a few key areas to further enhance the quality of these changes:
- Security: I've identified a couple of instances where permissions (both for a GCP IAM role and a Kubernetes ClusterRole) are overly broad. My suggestions aim to tighten these permissions by following the principle of least privilege.
- Efficiency and Reliability: The new CronJob for health checks can be made much more efficient and reliable by using a pre-built container image instead of installing dependencies on every run, aligning with guidelines for complex inline scripts.
- Maintainability: I've pointed out a minor issue with an outdated API version in the Kueue configuration to ensure future compatibility, and highlighted the need for consistent placeholder formatting as per repository rules.
Overall, this is a strong contribution. Addressing these points will improve the security, performance, and long-term maintainability of this blueprint.
5fd13b1 to
e0fc336
Compare
|
Integration test passing for GKE A3 high onspot |
|
PR tests passed: |
Change-Id: Ie064dcac4ec7c7e23909024c6c4f537275f045f2 Change-Id: If4ba715f16360e098952303c6b5749e663f829d7 Creating NCCL manifests for A3 high Change-Id: I63de674d221515e23bedce7301e2e0054fc8996f use kueue version 0.14.4 and apiVersion: kueue.x-k8s.io/v1beta1 Change-Id: Ibe5e2798b0cac34d5165359a1510fead0bcc9aa6 Adding nccl test bug fixes Change-Id: I77841eab68eb26570363374f28ee925b561955f8
…support Change-Id: I655e4ad823c9c43e1be8c82f65b5ac76e28a4fe6
Change-Id: I9c02cddd1823123fc20cb97c5938c9b7295fc509
Change-Id: Iee74231bb752d534febc7c0936c550e0841d0d52
Change-Id: I4efab20e43a0546929a3fb15f155647e7618a4ad
Change-Id: I36c7a7526c56c06635ded61e56e27c39ec835a23
Change-Id: I6bd9337d9c3fff5b713d410bb7d575f201e2ae86
Change-Id: I570e96127745f22a8e41e3158ef033315f92d2cf
Change-Id: Iae317eff6c4a2e6706d7d1e82edd263d3bd27a59
Change-Id: Ia1f4814708496bb2cacf9cc64cc9ff9dc90860d8
ee8a17b to
eb636b1
Compare
Change-Id: I9d9a739fa05666ab620236f4121034521776d674
|
Integration test updates:
|
Align GKE A3 High Blueprint with other A* machines
Summary
This PR refactors the GKE A3 High GPU blueprint to align its architecture with the deployment patterns used by other A3 (Mega/Ultra) and A4 machines. Instead of relying on the node pool module to dynamically fetch upstream manifests, we now natively bundle the necessary GPUDirect manifests (NRI device injector, TCPX installer, NCCL config) and apply them explicitly during the blueprint creation process. We have alse created separate folder, deployment file and README for A3 high.
Motivation
Previously, the A3 High blueprint relied on the
gke-node-poolmodule to pull GPUDirect manifests directly from public GitHub URLs.In this PR we natively bundle manifests for several critical reasons:PodInitializingcrash loop caused by a bug in the upstreamenable-nriinitContainer. By natively including the manifests, we insulate deployments from upstream breaking changes. https://docs.cloud.google.com/kubernetes-engine/docs/troubleshooting/gpus#tcpx-daemon-upgrade-failure.yamlor.tftplfiles gives operators direct control to patch configurations and customize components specifically for their workloads.Changes
examples/gke-a3-highgpu/): Added explicit, localized, and tested versions ofnccl-config.yaml,nri-device-injector.yaml, andnccl-tcpx-installer.yaml.tftplto the blueprint directory.examples/gke-a3-highgpu/gke-a3-highgpu.yaml):workload_component_installstep to natively apply the GPUDirect manifests directly to the cluster during the blueprint creation phase.install_gpu_direct_manifests: falseto thea3_highgpu_poolto disable the legacy remote download behavior.modules/compute/gke-node-pool): Introduced theinstall_gpu_direct_manifestsvariable (defaulting totruefor backwards compatibility) to gracefully opt-out of legacy URL manifest injections.ResourceFlavors,ClusterQueues, andLocalQueuesdirectly on the cluster alongside the GPUDirect infrastructure.Testing & Validation
containerdnode initialization errors.Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.