Skip to content

Revamp GKE A3 High blueprint and align integration tests#5246

Merged
shubpal07 merged 14 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/a3-high-upgrade
Apr 3, 2026
Merged

Revamp GKE A3 High blueprint and align integration tests#5246
shubpal07 merged 14 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/a3-high-upgrade

Conversation

@shubpal07

@shubpal07 shubpal07 commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Align GKE A3 High Blueprint with other A* machines

Summary

This PR refactors the GKE A3 High GPU blueprint to align its architecture with the deployment patterns used by other A3 (Mega/Ultra) and A4 machines. Instead of relying on the node pool module to dynamically fetch upstream manifests, we now natively bundle the necessary GPUDirect manifests (NRI device injector, TCPX installer, NCCL config) and apply them explicitly during the blueprint creation process. We have alse created separate folder, deployment file and README for A3 high.

Motivation

Previously, the A3 High blueprint relied on the gke-node-pool module to pull GPUDirect manifests directly from public GitHub URLs.In this PR we natively bundle manifests for several critical reasons:

  1. Architectural Alignment: This natively aligns A3 High with other modern accelerator blueprints (like Mega and Ultra), creating a standardized topology for all A* instances where GPUDirect assets are managed as local, version-controlled module assets.
  2. Improved Stability & Predictability: Relying on upstream remote manifests introduced runtime instability. A recent example of this was integration tests failing due to a PodInitializing crash loop caused by a bug in the upstream enable-nri initContainer. By natively including the manifests, we insulate deployments from upstream breaking changes. https://docs.cloud.google.com/kubernetes-engine/docs/troubleshooting/gpus#tcpx-daemon-upgrade-failure
  3. Enhanced Customization: Bundling manifests as local .yaml or .tftpl files gives operators direct control to patch configurations and customize components specifically for their workloads.

Changes

  • Natively Added GPUDirect Manifests (examples/gke-a3-highgpu/): Added explicit, localized, and tested versions of nccl-config.yaml, nri-device-injector.yaml, and nccl-tcpx-installer.yaml.tftpl to the blueprint directory.
  • Refactored Blueprint Deployment (examples/gke-a3-highgpu/gke-a3-highgpu.yaml):
    • Instructed the workload_component_install step to natively apply the GPUDirect manifests directly to the cluster during the blueprint creation phase.
    • Passed install_gpu_direct_manifests: false to the a3_highgpu_pool to disable the legacy remote download behavior.
  • Module Support for Local Manifests (modules/compute/gke-node-pool): Introduced the install_gpu_direct_manifests variable (defaulting to true for backwards compatibility) to gracefully opt-out of legacy URL manifest injections.
  • Inbuilt Kueue Support: Integrated and configured complete Kueue mechanisms native to the blueprint, enabling automated job scheduling. The deployment now establishes Kueue ResourceFlavors, ClusterQueues, and LocalQueues directly on the cluster alongside the GPUDirect infrastructure.
  • Extensive Documentation Added: Introduced comprehensive guides to assist users with validating their node configurations using NCCL tests:
    • Single-Node Test Plan: A guide for verifying NCCL intra-node bandwidth topologies.
    • Multi-Node Test Plan: A comprehensive playbook covering inter-node testing and GPUDirect TCPX benchmarks.

Testing & Validation

  • A3 Spot Integration Tests: Multi-node integration tests on A3 spot instances successfully completed without containerd node initialization errors.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly modernizes the GKE A3 High blueprint by reorganizing its structure, enhancing its documentation, and integrating advanced features. The changes aim to align the blueprint with the standards of other GKE A* family offerings, providing users with a more robust and feature-rich solution for high-performance ML training. Key additions include support for Kueue's Topology Aware Scheduling for efficient GPU workload management and the implementation of Cluster Health Services for automated GPU diagnostics, ensuring optimal cluster performance and reliability.

Highlights

  • Blueprint Restructuring: The GKE A3 High blueprint has been moved into its own dedicated directory (examples/gke-a3-highgpu/) to improve organization and modularity, aligning it with other GKE A* family blueprints.
  • Enhanced Documentation: Comprehensive documentation has been added, including a new README.md within the blueprint's directory, providing detailed instructions for deployment, configuration, and performance verification.
  • Kueue Topology Aware Scheduling (TAS) Integration: The blueprint now includes support for Kueue, enabling Topology Aware Scheduling for optimized workload placement on A3 High GPU nodes, configured via new Kubernetes resources.
  • Cluster Health Services (CHS) Implementation: Automated GPU health checks and diagnostic reporting are introduced through Cluster Health Services (CHS), configured as a Kubernetes CronJob with associated permissions and persistent storage.
  • Updated Blueprint Configuration: The main blueprint file (gke-a3-highgpu.yaml) has been updated with new variables and configurations to support Kueue, CHS, GCS FUSE CSI, and more flexible node pool settings like reservation affinity.
  • Integration Test Alignment: Existing integration tests (gke-a3-highgpu-onspot.yaml, gke-a3-highgpu.yaml, gke-a3-highgpu-onspot.yml, gke-a3-highgpu.yml) have been updated to reflect the new blueprint path and incorporate new test variables and validation for Kueue configuration.
Changelog
  • examples/README.md
    • Updated the description for the gke-a3-highgpu.yaml blueprint to reflect its new capabilities and refer to the dedicated deployment guide.
    • Removed the whatismyip.com reference from the authorized CIDR note.
    • Adjusted the link to the gke-a3-highgpu.yaml blueprint to its new directory path.
  • examples/gke-a3-highgpu/README.md
    • Added a new, comprehensive deployment guide for the A3 High GKE cluster, detailing prerequisites, configuration, deployment steps, NCCL performance verification, and cleanup procedures.
  • examples/gke-a3-highgpu/chs-cronjob.yaml.tftpl
    • Added a new Kubernetes CronJob template for Cluster Health Services (CHS), which periodically runs diagnostic checks on GPU nodes and saves results.
  • examples/gke-a3-highgpu/chs-permissions.yaml.tftpl
    • Added new Kubernetes ServiceAccount, ClusterRole, and ClusterRoleBinding templates to grant necessary permissions for the Cluster Health Services (CHS) CronJob.
  • examples/gke-a3-highgpu/chs-pvc.yaml.tftpl
    • Added a new Kubernetes PersistentVolumeClaim template for storing output from the Cluster Health Services (CHS) CronJob.
  • examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml
    • Added a new deployment configuration file with default variables for project ID, deployment name, region, zone, authorized CIDR, static node count, reservation, and optional periodic health checks.
  • examples/gke-a3-highgpu/gke-a3-highgpu.yaml
    • Renamed and moved the blueprint file from examples/gke-a3-highgpu.yaml to examples/gke-a3-highgpu/gke-a3-highgpu.yaml.
    • Introduced new variables for Kueue configuration path, static node count, reservation, accelerator type, GKE version prefix, and various Cluster Health Services (CHS) settings.
    • Added the container.admin IAM role to the node_pool_service_account.
    • Enabled GCS FUSE CSI for the GKE cluster.
    • Configured the GKE node pool to use specified accelerator types and reservation affinity.
    • Integrated kubectl-apply manifests for CHS permissions, PVC, and CronJob, enabling conditional deployment based on enable_periodic_health_checks.
    • Updated Kueue installation to include a configurable path for its configuration and dynamic template variables for GPU count and accelerator type.
  • examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl
    • Added new Kubernetes resource templates for Kueue, including Topology, ResourceFlavor, ClusterQueue, and LocalQueue, to enable Topology Aware Scheduling for A3 High GPUs.
  • examples/gke-a3-highgpu/read-chs-logs-job.yaml
    • Added a new Kubernetes Job to facilitate reading the output logs generated by the Cluster Health Services (CHS) CronJob from the PVC.
  • tools/cloud-build/daily-tests/builds/gke-a3-highgpu-onspot.yaml
    • Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
  • tools/cloud-build/daily-tests/builds/gke-a3-highgpu.yaml
    • Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
  • tools/cloud-build/daily-tests/tests/gke-a3-highgpu-onspot.yml
    • Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
    • Removed the network variable and added new variables for static_node_count, instance_type, accelerator_type, and num_gpus.
    • Modified cli_deployment_vars to use the new static_node_count variable and removed network_name.
    • Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.
  • tools/cloud-build/daily-tests/tests/gke-a3-highgpu.yml
    • Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
    • Removed reservation_affinity and network variables, and added new variables for extended_reservation, static_node_count, instance_type, accelerator_type, and num_gpus.
    • Modified cli_deployment_vars to use the new static_node_count and reservation variables, and removed network_name.
    • Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and valuable modernization of the GKE A3 High blueprint. The restructuring into a dedicated directory, addition of comprehensive documentation, and integration of advanced features like Kueue for Topology Aware Scheduling and Cluster Health Services (CHS) are excellent improvements that align it with the standards of other A* family blueprints.

My review focuses on a few key areas to further enhance the quality of these changes:

  • Security: I've identified a couple of instances where permissions (both for a GCP IAM role and a Kubernetes ClusterRole) are overly broad. My suggestions aim to tighten these permissions by following the principle of least privilege.
  • Efficiency and Reliability: The new CronJob for health checks can be made much more efficient and reliable by using a pre-built container image instead of installing dependencies on every run, aligning with guidelines for complex inline scripts.
  • Maintainability: I've pointed out a minor issue with an outdated API version in the Kueue configuration to ensure future compatibility, and highlighted the need for consistent placeholder formatting as per repository rules.

Overall, this is a strong contribution. Addressing these points will improve the security, performance, and long-term maintainability of this blueprint.

Comment thread examples/gke-a3-highgpu/gke-a3-highgpu.yaml
Comment thread examples/gke-a3-highgpu/chs-cronjob.yaml.tftpl Outdated
Comment thread examples/gke-a3-highgpu/chs-permissions.yaml.tftpl Outdated
Comment thread examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml Outdated
Comment thread examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl Outdated
@shubpal07 shubpal07 force-pushed the shubham/a3-high-upgrade branch 3 times, most recently from 5fd13b1 to e0fc336 Compare March 17, 2026 10:58
@shubpal07 shubpal07 marked this pull request as ready for review March 20, 2026 12:31
@shubpal07 shubpal07 requested review from a team and samskillman as code owners March 20, 2026 12:31
@shubpal07 shubpal07 self-assigned this Mar 20, 2026
@shubpal07 shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Mar 20, 2026
@shubpal07

Copy link
Copy Markdown
Contributor Author

Integration test passing for GKE A3 high onspot

@shubpal07 shubpal07 changed the title Modernize GKE A3 High blueprint and align integration tests Revamp GKE A3 High blueprint and align integration tests Mar 20, 2026
@shubpal07

Copy link
Copy Markdown
Contributor Author

PR tests passed:

  1. PR-test-gke-a3-highgpu
  2. PR-test-gke-a3-highgpu-onspot

Comment thread examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl Outdated
Comment thread examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml Outdated
Comment thread examples/gke-a3-highgpu/README.md Outdated
@shubpal07 shubpal07 requested a review from vikramvs-gg March 27, 2026 14:30
vikramvs-gg
vikramvs-gg previously approved these changes Mar 30, 2026
Change-Id: Ie064dcac4ec7c7e23909024c6c4f537275f045f2

Change-Id: If4ba715f16360e098952303c6b5749e663f829d7

Creating NCCL manifests for A3 high

Change-Id: I63de674d221515e23bedce7301e2e0054fc8996f

use kueue version 0.14.4 and apiVersion: kueue.x-k8s.io/v1beta1

Change-Id: Ibe5e2798b0cac34d5165359a1510fead0bcc9aa6

Adding nccl test bug fixes

Change-Id: I77841eab68eb26570363374f28ee925b561955f8
…support

Change-Id: I655e4ad823c9c43e1be8c82f65b5ac76e28a4fe6
Change-Id: I9c02cddd1823123fc20cb97c5938c9b7295fc509
Change-Id: Iee74231bb752d534febc7c0936c550e0841d0d52
Change-Id: I4efab20e43a0546929a3fb15f155647e7618a4ad
Change-Id: I36c7a7526c56c06635ded61e56e27c39ec835a23
Change-Id: I6bd9337d9c3fff5b713d410bb7d575f201e2ae86
Change-Id: I570e96127745f22a8e41e3158ef033315f92d2cf
Change-Id: Iae317eff6c4a2e6706d7d1e82edd263d3bd27a59
Change-Id: Ia1f4814708496bb2cacf9cc64cc9ff9dc90860d8
@shubpal07 shubpal07 force-pushed the shubham/a3-high-upgrade branch from ee8a17b to eb636b1 Compare March 31, 2026 10:28
Comment thread tools/cloud-build/daily-tests/builds/gke-a3-highgpu.yaml
Comment thread tools/cloud-build/daily-tests/tests/gke-a3-highgpu.yml
Change-Id: I9d9a739fa05666ab620236f4121034521776d674
Comment thread modules/management/kubectl-apply/main.tf
@shubpal07

Copy link
Copy Markdown
Contributor Author

Integration test updates:

  • Most tests passed
  • Tests failing are not relevant to current changes and have been failing due to reservation not present, stock out issues.

@shubpal07 shubpal07 merged commit 5a7be76 into GoogleCloudPlatform:develop Apr 3, 2026
39 of 86 checks passed
simrankaurb pushed a commit to simrankaurb/cluster-toolkit that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants