Skip to content

Revamp GKE A3 Mega blueprint and align integration tests#5483

Merged
agrawalkhushi18 merged 4 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:a3m-revamp
Apr 16, 2026
Merged

Revamp GKE A3 Mega blueprint and align integration tests#5483
agrawalkhushi18 merged 4 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:a3m-revamp

Conversation

@agrawalkhushi18

Copy link
Copy Markdown
Contributor

Summary

This PR natively bundles GPUDirect manifests and refactors the A3 Mega blueprint for improved stability, predictability, and standardization across A* accelerator instances.

Motivation

  • Architectural Alignment: Standardizes A3 Mega with other modern accelerator blueprints (A3 High, Ultra) where GPUDirect assets are version-controlled locally
  • Improved Stability: Eliminates runtime instability from upstream remote manifests and insulates deployments from breaking changes
  • Enhanced Customization: Direct control to patch configurations and customize components for specific workloads

Changes

  • Added GPUDirect Manifests (examples/gke-a3-megagpu/): nccl-tcpxo-installer.yaml.tftpl, nri-device-injector.yaml, and nccl-test-latest.yaml
  • Refactored Blueprint (gke-a3-megagpu.yaml): Updated workload component install step to apply GPUDirect manifests directly during deployment; set install_gpu_direct_manifests: false on node pool
  • Enhanced Documentation: Comprehensive README with deployment instructions, prerequisites, and NCCL validation procedures

Testing

A3M Spot Integration Tests: Validated with improved test automation including explicit DaemonSet verification and pod readiness checks.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the GKE A3 Mega blueprint to enhance deployment stability and predictability. By transitioning to locally version-controlled GPUDirect manifests and refining the integration test suite, the changes ensure a more robust and customizable environment for A3 accelerator instances.

Highlights

  • Native GPUDirect Manifests: Bundled GPUDirect TCPXO and NRI device injector manifests directly into the repository to improve stability and version control.
  • Blueprint Refactoring: Updated the A3 Mega blueprint to apply GPUDirect manifests during deployment and disabled automatic manifest installation on node pools for better control.
  • Enhanced Integration Testing: Improved A3 Mega spot integration tests with explicit DaemonSet readiness checks and updated cleanup procedures.
  • Documentation Updates: Provided comprehensive deployment and verification instructions for A3 Mega clusters in the updated README.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@agrawalkhushi18 agrawalkhushi18 added the release-improvements Added to release notes under the "Improvements" heading. label Apr 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive example for deploying A3 Mega GKE clusters, featuring detailed documentation, infrastructure blueprints, and manifests for GPU performance testing. Key additions include the NCCL TCPXO installer and NRI device injector manifests, alongside updates to the validation Ansible playbook to ensure robust testing. Review feedback highlights several opportunities to improve maintainability and portability: specifically, addressing a 'bash-ism' in the installer script that could fail in non-bash environments, removing redundant sudo commands, and converting static manifests into templates to avoid hardcoding accelerator types and software versions.

Comment thread examples/gke-a3-megagpu/nccl-tcpxo-installer.yaml.tftpl
Comment thread examples/gke-a3-megagpu/nri-device-injector.yaml
Comment thread examples/gke-a3-megagpu/nccl-test-latest.yaml
@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review April 13, 2026 04:58
@agrawalkhushi18 agrawalkhushi18 requested review from a team and samskillman as code owners April 13, 2026 04:58
Comment thread examples/gke-a3-megagpu/gke-a3-megagpu-deployment.yaml
Comment thread examples/gke-a3-megagpu/gke-a3-megagpu.yaml Outdated
@agrawalkhushi18 agrawalkhushi18 merged commit 2ee5a53 into GoogleCloudPlatform:develop Apr 16, 2026
14 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants