Skip to content

NAP support on GKE Clusters (gke-cluster module)#5420

Merged
SwarnaBharathiMantena merged 32 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/nap_gke_cluster
May 13, 2026
Merged

NAP support on GKE Clusters (gke-cluster module)#5420
SwarnaBharathiMantena merged 32 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/nap_gke_cluster

Conversation

@SwarnaBharathiMantena

@SwarnaBharathiMantena SwarnaBharathiMantena commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

GKE Node Auto-Provisioning (NAP) Support: Added support for GKE Node Auto-Provisioning (NAP) in the gke-cluster module, enabling dynamic node pool provisioning based on workload requirements.

Terraform Configuration Updates: Updated Terraform variables and main configuration to include cluster_autoscaling settings, allowing for dynamic resource limits and auto-provisioning defaults.

Go Configuration Logic: Implemented Go-based logic to parse machine types and inject internal variables for accelerator configuration, ensuring proper NAP setup.

Testing: Added comprehensive unit tests to validate the new autoscaling configuration logic and accelerator extraction.

@SwarnaBharathiMantena SwarnaBharathiMantena added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Mar 28, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for GKE Node Auto-Provisioning (NAP) within the gke-cluster module. It enhances the cluster configuration by allowing users to define autoscaling limits and defaults, which are then processed and validated through new Go-based configuration logic. This change enables more flexible and efficient resource management for GKE clusters by automatically adjusting node pools based on workload demands.

Highlights

  • GKE Node Auto-Provisioning (NAP) Support: Added support for GKE Node Auto-Provisioning (NAP) in the gke-cluster module, allowing for dynamic node pool provisioning based on workload requirements.
  • Terraform Configuration Updates: Updated Terraform variables and main configuration to include cluster_autoscaling settings, enabling dynamic resource limits and auto-provisioning defaults.
  • Go Configuration Logic: Implemented Go-based logic to parse machine types and inject internal variables for accelerator configuration, ensuring proper NAP setup.
  • Testing: Added comprehensive unit tests to validate the new autoscaling configuration logic and accelerator extraction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements GKE Node Auto-Provisioning (NAP) and Cluster Autoscaling by adding Terraform configurations, input variables, and Go logic to process accelerator limits. Feedback highlights a critical logic error where machine types are incorrectly mapped to accelerator types, which will cause Terraform failures and requires updates to the unit tests. Additionally, it is recommended to replace hardcoded resource limits and disk settings with configurable variables to improve the module's flexibility and maintainability.

Comment thread modules/scheduler/gke-cluster/main.tf Outdated
Comment thread pkg/config/autoscaling.go Outdated
Comment thread pkg/config/autoscaling_test.go Outdated
Comment thread modules/scheduler/gke-cluster/main.tf Outdated
Comment thread modules/scheduler/gke-cluster/main.tf Outdated
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini summary

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request introduces support for GKE Node Auto-Provisioning (NAP) within the gke-cluster module. It enhances the cluster configuration by allowing users to define autoscaling limits and defaults, which are then processed and validated through new Go-based configuration logic. This change enables more flexible and efficient resource management for GKE clusters by automatically adjusting node pools based on workload demands.

Highlights

  • GKE Node Auto-Provisioning (NAP) Support: Added support for GKE Node Auto-Provisioning (NAP) in the gke-cluster module, enabling dynamic node pool provisioning based on workload requirements.
  • Terraform Configuration Updates: Updated Terraform variables and main configuration to include cluster_autoscaling settings, allowing for dynamic resource limits and auto-provisioning defaults.
  • Go Configuration Logic: Implemented Go-based logic to parse machine types and inject internal variables for accelerator configuration, ensuring proper NAP setup.
  • Testing: Added comprehensive unit tests to validate the new autoscaling configuration logic and accelerator extraction.
Activity
  • Gemini Code Assist provided an initial summary and identified several critical and high-priority issues regarding resource type mapping and test validation.
  • The author requested a summary of the pull request.
  • Review comments were generated highlighting necessary fixes for accelerator type mapping, hardcoded values, and test cases.

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@SwarnaBharathiMantena SwarnaBharathiMantena added release-key-new-features Added to release notes under the "Key New Features" heading. and removed release-module-improvements Added to release notes under the "Module Improvements" heading. labels Apr 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for GKE Node Auto-Provisioning (NAP) and Cluster Autoscaling. Key changes include a new cluster_autoscaling variable in the GKE module, dynamic resource limit configuration, and Go-based logic to resolve accelerator details from machine types. Feedback focuses on the hardcoding of disk parameters (type and size) in the auto-provisioning defaults, which could cause compatibility issues, and a restrictive default for accelerator scaling limits that may hinder cluster expansion.

Comment thread modules/scheduler/gke-cluster/main.tf Outdated
Comment thread modules/scheduler/gke-cluster/main.tf Outdated
Comment thread pkg/config/autoscaling.go Outdated
@SwarnaBharathiMantena SwarnaBharathiMantena marked this pull request as ready for review April 27, 2026 07:35
@SwarnaBharathiMantena SwarnaBharathiMantena requested a review from a team as a code owner April 27, 2026 07:35
Comment thread pkg/config/autoscaling.go Outdated
Comment thread pkg/config/autoscaling.go
Comment thread pkg/config/autoscaling.go Outdated
Comment thread pkg/config/autoscaling.go Outdated
Comment thread pkg/config/autoscaling.go Outdated
Comment thread pkg/config/autoscaling.go Outdated
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for GKE Node Auto-Provisioning (NAP) and Cluster Autoscaling by adding a new cluster_autoscaling input variable and implementing logic to dynamically configure resource limits and defaults. The changes include updates to the GKE cluster Terraform module and new Go functionality to resolve accelerator details from machine types. Feedback focuses on ensuring correct GKE-specific resource types for TPUs, avoiding duplicate resource limits for CPU and memory, failing fast on invalid accelerator configurations, and improving Go code maintainability by reusing existing configuration structures.

Comment thread pkg/config/autoscaling.go
Comment thread modules/scheduler/gke-cluster/main.tf Outdated
Comment thread pkg/config/autoscaling.go
Comment thread pkg/config/autoscaling.go Outdated
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/nap_gke_cluster branch from 179e8ce to 44f608b Compare May 6, 2026 08:32
Comment thread pkg/config/autoscaling.go Outdated

@scaliby scaliby left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

The failing tests are either capacity issues or known issues the team is working on.

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

The failing tests are either capacity issues or known issues the team is working on.

Comment thread pkg/config/machine_mappings.json Outdated
Comment thread pkg/config/machine_mappings.json
Comment thread pkg/config/machine_mappings.json
Comment thread pkg/config/machine_mappings.json Outdated
…treamline autoscaling to be optional, and add fractional GPU shorthands
Comment thread modules/scheduler/gke-cluster/README.md Outdated
Comment thread pkg/config/autoscaling.go Outdated
kadupoornima
kadupoornima previously approved these changes May 12, 2026
… enabled flag, rename cache, and update v5litepod naming
@SwarnaBharathiMantena SwarnaBharathiMantena enabled auto-merge (squash) May 12, 2026 16:34
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

Test failures:

Reservation does not exist

  1. PR-test-gke-a3-highgpu
  2. PR-test-gke-h4d

Capacity issues
3. PR-test-gke-a4-onspot

Known issues
4. PR-test-gke-g4-onspot
5. PR-test-gke-tpu-7x
6. PR-test-gke-tpu-v6e-flex
7. PR-test-gke-a3-highgpu-onspot

@Neelabh94 Neelabh94 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some nit comments on improvements. They can be done as a follow up PR.

LGTM!

Comment thread modules/internal/tpu-definition/main.tf
Comment thread modules/scheduler/gke-cluster/main.tf
@SwarnaBharathiMantena SwarnaBharathiMantena merged commit 6a44cf8 into GoogleCloudPlatform:develop May 13, 2026
31 of 82 checks passed
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants