Skip to content

Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility#5673

Merged
kadupoornima merged 1 commit into
GoogleCloudPlatform:developfrom
kadupoornima:a3h
May 18, 2026
Merged

Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility#5673
kadupoornima merged 1 commit into
GoogleCloudPlatform:developfrom
kadupoornima:a3h

Conversation

@kadupoornima

@kadupoornima kadupoornima commented May 15, 2026

Copy link
Copy Markdown
Contributor

Context / Issue:
The gke-a3-highgpu-onspot integration tests have been failing during the infrastructure provisioning phase. The root cause is an incompatibility with the Container-Optimized OS (COS) version bundled in recent GKE versions. Specifically, GKE 1.35.x versions use COS 125, but the a3-highgpu-8g machine type currently only supports COS version 121 or lower.

Changes in this PR:

  • Pinned GKE Version: Updated the version_prefix in the gke-a3-highgpu blueprint to "1.33." to ensure the node pool uses a compatible COS version (<= 121).
  • Release Channel Override: Added configuration to manage the release_channel (e.g., setting it to UNSPECIFIED or STABLE). This prevents GKE from ignoring the 1.33 prefix and overriding it with the newer 1.35 default target from the Rapid/Regular channels.

@kadupoornima kadupoornima requested a review from a team as a code owner May 15, 2026 17:46
@kadupoornima kadupoornima added the release-version-updates Added to release notes under the "Version Updates" heading. label May 15, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the GKE configuration for the a3-highgpu example. It aligns the cluster with a newer version, enforces a regular release channel, and introduces specific maintenance exclusions alongside enabling auto-upgrades to ensure better lifecycle management of the node pool.

Highlights

  • GKE Version Update: Updated the GKE version prefix from 1.32 to 1.33 in the configuration.
  • Maintenance and Upgrade Policy: Configured the release channel to REGULAR, added a long-term maintenance exclusion, and enabled auto-upgrades for the GKE node pool.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GKE version prefix to 1.33 and introduces several configuration changes to the A3 high-GPU example, including setting the release channel to REGULAR, enabling auto-upgrades for the node pool, and adding maintenance exclusions. A critical issue was identified in the maintenance_exclusions block, which is missing the mandatory end_time field required by the underlying Terraform provider and GKE API, even when using the UNTIL_END_OF_SUPPORT behavior.

Comment thread examples/gke-a3-highgpu/gke-a3-highgpu.yaml
@kadupoornima kadupoornima changed the title gke version fix Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility May 15, 2026
@kadupoornima kadupoornima added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 15, 2026
@kadupoornima kadupoornima enabled auto-merge (squash) May 15, 2026 17:52

@agrawalkhushi18 agrawalkhushi18 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kadupoornima kadupoornima merged commit a592e6b into GoogleCloudPlatform:develop May 18, 2026
22 of 91 checks passed
@kadupoornima kadupoornima deleted the a3h branch May 22, 2026 07:59
kadupoornima added a commit to kadupoornima/cluster-toolkit that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading. release-version-updates Added to release notes under the "Version Updates" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants