Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility#5673
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the GKE configuration for the a3-highgpu example. It aligns the cluster with a newer version, enforces a regular release channel, and introduces specific maintenance exclusions alongside enabling auto-upgrades to ensure better lifecycle management of the node pool. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the GKE version prefix to 1.33 and introduces several configuration changes to the A3 high-GPU example, including setting the release channel to REGULAR, enabling auto-upgrades for the node pool, and adding maintenance exclusions. A critical issue was identified in the maintenance_exclusions block, which is missing the mandatory end_time field required by the underlying Terraform provider and GKE API, even when using the UNTIL_END_OF_SUPPORT behavior.
a592e6b
into
GoogleCloudPlatform:develop
Context / Issue:
The
gke-a3-highgpu-onspotintegration tests have been failing during the infrastructure provisioning phase. The root cause is an incompatibility with the Container-Optimized OS (COS) version bundled in recent GKE versions. Specifically, GKE1.35.xversions use COS 125, but thea3-highgpu-8gmachine type currently only supports COS version 121 or lower.Changes in this PR:
version_prefixin thegke-a3-highgpublueprint to"1.33."to ensure the node pool uses a compatible COS version (<= 121).release_channel(e.g., setting it toUNSPECIFIEDorSTABLE). This prevents GKE from ignoring the1.33prefix and overriding it with the newer1.35default target from the Rapid/Regular channels.