Revamp GKE A3 High blueprint and align integration tests by shubpal07 · Pull Request #5246 · GoogleCloudPlatform/cluster-toolkit

shubpal07 · 2026-02-16T18:30:22Z

Align GKE A3 High Blueprint with other A* machines

Summary

This PR refactors the GKE A3 High GPU blueprint to align its architecture with the deployment patterns used by other A3 (Mega/Ultra) and A4 machines. Instead of relying on the node pool module to dynamically fetch upstream manifests, we now natively bundle the necessary GPUDirect manifests (NRI device injector, TCPX installer, NCCL config) and apply them explicitly during the blueprint creation process. We have alse created separate folder, deployment file and README for A3 high.

Motivation

Previously, the A3 High blueprint relied on the gke-node-pool module to pull GPUDirect manifests directly from public GitHub URLs.In this PR we natively bundle manifests for several critical reasons:

Architectural Alignment: This natively aligns A3 High with other modern accelerator blueprints (like Mega and Ultra), creating a standardized topology for all A* instances where GPUDirect assets are managed as local, version-controlled module assets.
Improved Stability & Predictability: Relying on upstream remote manifests introduced runtime instability. A recent example of this was integration tests failing due to a PodInitializing crash loop caused by a bug in the upstream enable-nri initContainer. By natively including the manifests, we insulate deployments from upstream breaking changes. https://docs.cloud.google.com/kubernetes-engine/docs/troubleshooting/gpus#tcpx-daemon-upgrade-failure
Enhanced Customization: Bundling manifests as local .yaml or .tftpl files gives operators direct control to patch configurations and customize components specifically for their workloads.

Changes

Natively Added GPUDirect Manifests (examples/gke-a3-highgpu/): Added explicit, localized, and tested versions of nccl-config.yaml, nri-device-injector.yaml, and nccl-tcpx-installer.yaml.tftpl to the blueprint directory.
Refactored Blueprint Deployment (examples/gke-a3-highgpu/gke-a3-highgpu.yaml):
- Instructed the workload_component_install step to natively apply the GPUDirect manifests directly to the cluster during the blueprint creation phase.
- Passed install_gpu_direct_manifests: false to the a3_highgpu_pool to disable the legacy remote download behavior.
Module Support for Local Manifests (modules/compute/gke-node-pool): Introduced the install_gpu_direct_manifests variable (defaulting to true for backwards compatibility) to gracefully opt-out of legacy URL manifest injections.
Inbuilt Kueue Support: Integrated and configured complete Kueue mechanisms native to the blueprint, enabling automated job scheduling. The deployment now establishes Kueue ResourceFlavors, ClusterQueues, and LocalQueues directly on the cluster alongside the GPUDirect infrastructure.
Extensive Documentation Added: Introduced comprehensive guides to assist users with validating their node configurations using NCCL tests:
- Single-Node Test Plan: A guide for verifying NCCL intra-node bandwidth topologies.
- Multi-Node Test Plan: A comprehensive playbook covering inter-node testing and GPUDirect TCPX benchmarks.

Testing & Validation

A3 Spot Integration Tests: Multi-node integration tests on A3 spot instances successfully completed without containerd node initialization errors.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

gemini-code-assist · 2026-02-16T18:30:44Z

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly modernizes the GKE A3 High blueprint by reorganizing its structure, enhancing its documentation, and integrating advanced features. The changes aim to align the blueprint with the standards of other GKE A* family offerings, providing users with a more robust and feature-rich solution for high-performance ML training. Key additions include support for Kueue's Topology Aware Scheduling for efficient GPU workload management and the implementation of Cluster Health Services for automated GPU diagnostics, ensuring optimal cluster performance and reliability.

Highlights

Blueprint Restructuring: The GKE A3 High blueprint has been moved into its own dedicated directory (examples/gke-a3-highgpu/) to improve organization and modularity, aligning it with other GKE A* family blueprints.
Enhanced Documentation: Comprehensive documentation has been added, including a new README.md within the blueprint's directory, providing detailed instructions for deployment, configuration, and performance verification.
Kueue Topology Aware Scheduling (TAS) Integration: The blueprint now includes support for Kueue, enabling Topology Aware Scheduling for optimized workload placement on A3 High GPU nodes, configured via new Kubernetes resources.
Cluster Health Services (CHS) Implementation: Automated GPU health checks and diagnostic reporting are introduced through Cluster Health Services (CHS), configured as a Kubernetes CronJob with associated permissions and persistent storage.
Updated Blueprint Configuration: The main blueprint file (gke-a3-highgpu.yaml) has been updated with new variables and configurations to support Kueue, CHS, GCS FUSE CSI, and more flexible node pool settings like reservation affinity.
Integration Test Alignment: Existing integration tests (gke-a3-highgpu-onspot.yaml, gke-a3-highgpu.yaml, gke-a3-highgpu-onspot.yml, gke-a3-highgpu.yml) have been updated to reflect the new blueprint path and incorporate new test variables and validation for Kueue configuration.

Changelog

examples/README.md
- Updated the description for the gke-a3-highgpu.yaml blueprint to reflect its new capabilities and refer to the dedicated deployment guide.
- Removed the whatismyip.com reference from the authorized CIDR note.
- Adjusted the link to the gke-a3-highgpu.yaml blueprint to its new directory path.
examples/gke-a3-highgpu/README.md
- Added a new, comprehensive deployment guide for the A3 High GKE cluster, detailing prerequisites, configuration, deployment steps, NCCL performance verification, and cleanup procedures.
examples/gke-a3-highgpu/chs-cronjob.yaml.tftpl
- Added a new Kubernetes CronJob template for Cluster Health Services (CHS), which periodically runs diagnostic checks on GPU nodes and saves results.
examples/gke-a3-highgpu/chs-permissions.yaml.tftpl
- Added new Kubernetes ServiceAccount, ClusterRole, and ClusterRoleBinding templates to grant necessary permissions for the Cluster Health Services (CHS) CronJob.
examples/gke-a3-highgpu/chs-pvc.yaml.tftpl
- Added a new Kubernetes PersistentVolumeClaim template for storing output from the Cluster Health Services (CHS) CronJob.
examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml
- Added a new deployment configuration file with default variables for project ID, deployment name, region, zone, authorized CIDR, static node count, reservation, and optional periodic health checks.
examples/gke-a3-highgpu/gke-a3-highgpu.yaml
- Renamed and moved the blueprint file from examples/gke-a3-highgpu.yaml to examples/gke-a3-highgpu/gke-a3-highgpu.yaml.
- Introduced new variables for Kueue configuration path, static node count, reservation, accelerator type, GKE version prefix, and various Cluster Health Services (CHS) settings.
- Added the container.admin IAM role to the node_pool_service_account.
- Enabled GCS FUSE CSI for the GKE cluster.
- Configured the GKE node pool to use specified accelerator types and reservation affinity.
- Integrated kubectl-apply manifests for CHS permissions, PVC, and CronJob, enabling conditional deployment based on enable_periodic_health_checks.
- Updated Kueue installation to include a configurable path for its configuration and dynamic template variables for GPU count and accelerator type.
examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl
- Added new Kubernetes resource templates for Kueue, including Topology, ResourceFlavor, ClusterQueue, and LocalQueue, to enable Topology Aware Scheduling for A3 High GPUs.
examples/gke-a3-highgpu/read-chs-logs-job.yaml
- Added a new Kubernetes Job to facilitate reading the output logs generated by the Cluster Health Services (CHS) CronJob from the PVC.
tools/cloud-build/daily-tests/builds/gke-a3-highgpu-onspot.yaml
- Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
tools/cloud-build/daily-tests/builds/gke-a3-highgpu.yaml
- Updated the EXAMPLE_BP variable to reflect the new path of the gke-a3-highgpu.yaml blueprint.
tools/cloud-build/daily-tests/tests/gke-a3-highgpu-onspot.yml
- Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
- Removed the network variable and added new variables for static_node_count, instance_type, accelerator_type, and num_gpus.
- Modified cli_deployment_vars to use the new static_node_count variable and removed network_name.
- Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.
tools/cloud-build/daily-tests/tests/gke-a3-highgpu.yml
- Updated the blueprint_yaml path to the new location of the gke-a3-highgpu.yaml blueprint.
- Removed reservation_affinity and network variables, and added new variables for extended_reservation, static_node_count, instance_type, accelerator_type, and num_gpus.
- Modified cli_deployment_vars to use the new static_node_count and reservation variables, and removed network_name.
- Added test-validation/test-gke-kueue-config.yml to the post_deploy_tests.

Activity

No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and valuable modernization of the GKE A3 High blueprint. The restructuring into a dedicated directory, addition of comprehensive documentation, and integration of advanced features like Kueue for Topology Aware Scheduling and Cluster Health Services (CHS) are excellent improvements that align it with the standards of other A* family blueprints.

My review focuses on a few key areas to further enhance the quality of these changes:

Security: I've identified a couple of instances where permissions (both for a GCP IAM role and a Kubernetes ClusterRole) are overly broad. My suggestions aim to tighten these permissions by following the principle of least privilege.
Efficiency and Reliability: The new CronJob for health checks can be made much more efficient and reliable by using a pre-built container image instead of installing dependencies on every run, aligning with guidelines for complex inline scripts.
Maintainability: I've pointed out a minor issue with an outdated API version in the Kueue configuration to ensure future compatibility, and highlighted the need for consistent placeholder formatting as per repository rules.

Overall, this is a strong contribution. Addressing these points will improve the security, performance, and long-term maintainability of this blueprint.

shubpal07 · 2026-03-20T13:22:58Z

Integration test passing for GKE A3 high onspot

shubpal07 · 2026-03-20T14:56:29Z

PR tests passed:

Change-Id: Ie064dcac4ec7c7e23909024c6c4f537275f045f2 Change-Id: If4ba715f16360e098952303c6b5749e663f829d7 Creating NCCL manifests for A3 high Change-Id: I63de674d221515e23bedce7301e2e0054fc8996f use kueue version 0.14.4 and apiVersion: kueue.x-k8s.io/v1beta1 Change-Id: Ibe5e2798b0cac34d5165359a1510fead0bcc9aa6 Adding nccl test bug fixes Change-Id: I77841eab68eb26570363374f28ee925b561955f8

…support Change-Id: I655e4ad823c9c43e1be8c82f65b5ac76e28a4fe6

Change-Id: I9c02cddd1823123fc20cb97c5938c9b7295fc509

Change-Id: Iee74231bb752d534febc7c0936c550e0841d0d52

Change-Id: I4efab20e43a0546929a3fb15f155647e7618a4ad

Change-Id: I36c7a7526c56c06635ded61e56e27c39ec835a23

Change-Id: I6bd9337d9c3fff5b713d410bb7d575f201e2ae86

Change-Id: I570e96127745f22a8e41e3158ef033315f92d2cf

Change-Id: Iae317eff6c4a2e6706d7d1e82edd263d3bd27a59

Change-Id: Ia1f4814708496bb2cacf9cc64cc9ff9dc90860d8

Change-Id: I9d9a739fa05666ab620236f4121034521776d674

shubpal07 · 2026-04-02T14:29:17Z

Integration test updates:

Most tests passed
Tests failing are not relevant to current changes and have been failing due to reservation not present, stock out issues.

…Platform#5246)

gemini-code-assist Bot reviewed Feb 16, 2026

View reviewed changes

shubpal07 force-pushed the shubham/a3-high-upgrade branch 3 times, most recently from 5fd13b1 to e0fc336 Compare March 17, 2026 10:58

shubpal07 marked this pull request as ready for review March 20, 2026 12:31

shubpal07 requested review from a team and samskillman as code owners March 20, 2026 12:31

shubpal07 self-assigned this Mar 20, 2026

shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Mar 20, 2026

shubpal07 requested review from SwarnaBharathiMantena, agrawalkhushi18 and vikramvs-gg March 20, 2026 13:30

shubpal07 changed the title ~~Modernize GKE A3 High blueprint and align integration tests~~ Revamp GKE A3 High blueprint and align integration tests Mar 20, 2026

shubpal07 mentioned this pull request Mar 24, 2026

Migrate kubectl_apply_manifest module to helm #5282

Merged

vikramvs-gg reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/gke-a3-highgpu/kueue-configuration.yaml.tftpl Outdated

vikramvs-gg reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/gke-a3-highgpu/gke-a3-highgpu-deployment.yaml Outdated

vikramvs-gg reviewed Mar 25, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-a3-high.yml

vikramvs-gg reviewed Mar 25, 2026

View reviewed changes

Comment thread examples/gke-a3-highgpu/README.md Outdated

shubpal07 requested a review from vikramvs-gg March 27, 2026 14:30

vikramvs-gg previously approved these changes Mar 30, 2026

View reviewed changes

shubpal07 added 7 commits March 31, 2026 10:27

Adding single node nccl test plan for a3 high and fast socket plugin …

e5a37b5

…support Change-Id: I655e4ad823c9c43e1be8c82f65b5ac76e28a4fe6

nccl config changes

8c04b02

Change-Id: I9c02cddd1823123fc20cb97c5938c9b7295fc509

Multi-node nccl test manifests

a9a0eee

Adding Muti and single node NCCL test plans

37e5703

Remove fast socket changes

16453f8

Change-Id: Iee74231bb752d534febc7c0936c550e0841d0d52

pre-commit changes

b1b6d7e

Change-Id: I4efab20e43a0546929a3fb15f155647e7618a4ad

shubpal07 added 6 commits March 31, 2026 10:27

integration test changes

92a6fce

Change-Id: I36c7a7526c56c06635ded61e56e27c39ec835a23

Add var for controlling GPUDrircet injections

add1d24

Change-Id: I6bd9337d9c3fff5b713d410bb7d575f201e2ae86

exclude A3 high

2e13d06

Remove Single Node Test Plan

b6e2e81

Change-Id: I570e96127745f22a8e41e3158ef033315f92d2cf

Adding topology in Kueue config

c7f0773

Change-Id: Iae317eff6c4a2e6706d7d1e82edd263d3bd27a59

Adding readme nits

eb636b1

Change-Id: Ia1f4814708496bb2cacf9cc64cc9ff9dc90860d8

shubpal07 force-pushed the shubham/a3-high-upgrade branch from ee8a17b to eb636b1 Compare March 31, 2026 10:28

agrawalkhushi18 reviewed Apr 1, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/builds/gke-a3-highgpu.yaml

agrawalkhushi18 reviewed Apr 1, 2026

View reviewed changes

Comment thread tools/cloud-build/daily-tests/tests/gke-a3-highgpu.yml

Handling Kueue config template vars null case

19ee3fc

Change-Id: I9d9a739fa05666ab620236f4121034521776d674

shubpal07 dismissed vikramvs-gg’s stale review via 19ee3fc April 2, 2026 06:27

shubpal07 requested review from agrawalkhushi18 and vikramvs-gg April 2, 2026 06:31

shubpal07 commented Apr 2, 2026

View reviewed changes

Comment thread modules/management/kubectl-apply/main.tf

agrawalkhushi18 approved these changes Apr 2, 2026

View reviewed changes

vikramvs-gg approved these changes Apr 3, 2026

View reviewed changes

shubpal07 merged commit 5a7be76 into GoogleCloudPlatform:develop Apr 3, 2026
39 of 86 checks passed

simrankaurb pushed a commit to simrankaurb/cluster-toolkit that referenced this pull request Apr 7, 2026

Revamp GKE A3 High blueprint and align integration tests (GoogleCloud…

43ac117

…Platform#5246)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp GKE A3 High blueprint and align integration tests#5246

Revamp GKE A3 High blueprint and align integration tests#5246
shubpal07 merged 14 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/a3-high-upgrade

shubpal07 commented Feb 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubpal07 commented Mar 20, 2026

Uh oh!

shubpal07 commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubpal07 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shubpal07 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Align GKE A3 High Blueprint with other A* machines

Summary

Motivation

Changes

Testing & Validation

Submission Checklist

Uh oh!

gemini-code-assist Bot commented Feb 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubpal07 commented Mar 20, 2026

Uh oh!

shubpal07 commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shubpal07 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shubpal07 commented Feb 16, 2026 •

edited

Loading