Skip to content

DWS Flex start support in TPU 7x and v6e#5111

Merged
shubpal07 merged 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/dws-tpu
Jan 28, 2026
Merged

DWS Flex start support in TPU 7x and v6e#5111
shubpal07 merged 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/dws-tpu

Conversation

@shubpal07
Copy link
Contributor

@shubpal07 shubpal07 commented Jan 16, 2026

This PR enables TPU v6e and 7x support for GKE DWS Flex Start by introducing a dedicated blueprint and updating the gke-node-pool module. By exposing per-zone autoscaling variables, we satisfy API requirements for fixed-size TPU slices. This allows clusters to start with zero TPU nodes and dynamically scale up on demand, optimizing resource utilization and cost for large-scale AI/ML training workloads.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates DWS Flex Start functionality for TPU v6e within GKE, allowing for efficient, on-demand scaling of TPU resources. It achieves this by introducing a new blueprint specifically tailored for TPU v6e Flex Start and by enhancing the existing gke-node-pool module to support per-zone autoscaling configurations. The primary goal is to provide a cost-effective solution for AI/ML workloads by enabling TPU node pools to scale from zero nodes based on demand.

Highlights

  • TPU v6e Flex Start Support: Introduced comprehensive support for GKE Dynamic Workload Scheduler (DWS) Flex Start with TPU v6e, enabling dynamic scaling of TPU node pools.
  • New Blueprint for TPU v6e: Added a dedicated blueprint (gke-tpu-v6e-deployment.yaml, gke-tpu-v6e.yaml) to provision GKE clusters with TPU v6e and DWS Flex Start capabilities.
  • Dynamic Scaling from Zero Nodes: Enabled clusters to start with zero TPU nodes and dynamically scale up on demand, optimizing resource utilization and cost for large-scale AI/ML training workloads.
  • GKE Node Pool Module Update: Modified the gke-node-pool module to expose per-zone autoscaling variables (autoscaling_min_node_count, autoscaling_max_node_count) to satisfy API requirements for fixed-size TPU slices.
  • Enhanced Documentation: Updated documentation to guide users on creating and testing TPU v6e Flex clusters, including steps for enabling Flex Start and observing dynamic scaling behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@shubpal07 shubpal07 self-assigned this Jan 16, 2026
@shubpal07 shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully enables DWS Flex Start support for TPU v6e by introducing a new example blueprint and updating the gke-node-pool module. The changes are well-structured and align with the project's conventions.

My review includes a few suggestions to improve documentation clarity and fix a potential bug in the autoscaling logic within the gke-node-pool module. Specifically, the logic for handling mutual exclusivity between per-zone and total autoscaling settings could lead to an invalid configuration if not all per-zone variables are set.

Additionally, as per the repository's style guide (line 33), please ensure that the new example blueprint (gke-tpu-v6e) is added to the index in examples/README.md.

@shubpal07
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively enables DWS Flex Start support for TPU v6e. The changes are well-structured, with a new dedicated blueprint, necessary module updates, and clear documentation. I have included a few medium-severity suggestions to improve naming consistency and documentation links.

@shubpal07 shubpal07 changed the title DWS Flex start support in TPU v6e DWS Flex start support in TPU 7x and v6e Jan 19, 2026
@shubpal07
Copy link
Contributor Author

Added new commit for supporting DWS Flex start in TPU 7x with example blueprints and docs

Copy link
Contributor

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shubpal07 shubpal07 added the do-not-merge Block merging of this PR label Jan 20, 2026
Change-Id: I879423f59924badb4d28b4fb892e4701a4281364

Change-Id: Ic16fcbb5b0ed510960c97f80ac916e331026234e

Change-Id: I436f5bb6cee13efd083965493d56dd12b03fbaca

Change-Id: Id77229ac117ff851539372c1477e5665527bde16

Change-Id: Ifa27a86ad463a8d29ebdd927010c1dc0abbf644d

Change-Id: I87dc2a3b2bc8d813496a85789c132096ef5d070c

Change-Id: I1c1cc689b0a46d3376682877b65ed1283c05c5fe

Change-Id: I72773411f5e569e3562c722eaa0bdbad7b1e3fcd

Change-Id: I46912057ac6fb59dc1f106eaf01cc73db7a97432

Change-Id: Iada75244c08618606f84769cf3c29186d6ad1b87
@shubpal07 shubpal07 enabled auto-merge January 28, 2026 07:22
@shubpal07 shubpal07 merged commit 6602621 into GoogleCloudPlatform:develop Jan 28, 2026
9 of 72 checks passed
@shubpal07 shubpal07 removed the do-not-merge Block merging of this PR label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants