Skip to content

Adding GKE TPU DWS Queued Provisioning support for v6e and 7x#5218

Merged
shubpal07 merged 2 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/dws-qp-tpu
Feb 12, 2026
Merged

Adding GKE TPU DWS Queued Provisioning support for v6e and 7x#5218
shubpal07 merged 2 commits into
GoogleCloudPlatform:developfrom
shubpal07:shubham/dws-qp-tpu

Conversation

@shubpal07

@shubpal07 shubpal07 commented Feb 10, 2026

Copy link
Copy Markdown
Contributor

This PR implements and standardizes support for GKE TPU Dynamic Workload Scheduler (DWS) Flex Start with Queued Provisioning (QP). It enables queued provisioning for TPU v6e and TPU 7x hardware, ensuring large-scale training jobs only start when the full required topology is secured.

Key Changes

1. New Blueprints and Examples

  • Added dedicated QP blueprints for TPU v6e and TPU 7x under examples/gke-consumption-options/dws-flex-start-queued-provisioning/

  • Created a specialized Kueue template (tpu-dws-queues.yaml.tftpl) to
    manage ProvisioningRequestConfig and AdmissionCheck for TPU
    resources.

  • Included E2E test jobs (JobSets) for both hardware generations with
    correct annotations (maxRunDurationSeconds), tolerations, and node
    selectors.

2. Core Module Enhancements

  • gke-node-pool:
    • Automated the injection of the cloud.google.com/gke-queued:
      "true" label. While GKE manages the taint, the label is critical
      for nodeSelectors in JobSets to reliably target provisioned
      resources.
    • Updated placement policy preconditions to allow COMPACT
      placement specifically for TPUs when using Queued Provisioning.
  • tpu-definition:
    • Added a kubernetes_label output to centralize TPU metadata
      labeling.

4. Documentation

  • Established a modular documentation hierarchy:
    • Parent README: Overview of the DWS QP consumption model.
    • Child READMEs: Hardware-specific guides for v6e and 7x,
      including deployment steps, custom job requirements, and
      scale-up/down verification.

Verification Results

  • E2E Validation: Successfully verified the full lifecycle on a live
    TPU v6e and 7x cluster.
    • Flow: Job Submission -> ProvisioningRequest created/accepted ->
      Cluster Autoscaler scale-up (0 -> 4 nodes) -> Job execution ->
      Job completion -> Automatic scale-down to zero.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 requested review from a team and samskillman as code owners February 10, 2026 07:01
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for GKE TPU Dynamic Workload Scheduler (DWS) Flex Start with Queued Provisioning for TPU v6e and 7x hardware. The changes enable large-scale training jobs to initiate only when the full required topology is secured, enhancing resource management and job reliability. This involves adding new examples, refining core GKE modules, standardizing TPU configurations across the repository, and updating documentation to guide users through the new capabilities.

Highlights

  • New Blueprints and Examples: Added dedicated Queued Provisioning (QP) blueprints and examples for GKE TPU v6e and TPU 7x, including a specialized Kueue template for managing ProvisioningRequestConfig and AdmissionCheck for TPU resources. E2E test jobs (JobSets) with correct annotations, tolerations, and node selectors were also included for both hardware generations.
  • Core Module Enhancements: The gke-node-pool module was updated to integrate a new tpu_topology variable for node labeling, automate the injection of the cloud.google.com/gke-queued: 'true' label, and update placement policy preconditions to allow COMPACT placement specifically for TPUs when using Queued Provisioning. The tpu-definition module consolidated topology logic and added a kubernetes_label output for centralized TPU metadata labeling.
  • Repository-wide Standardization: All existing TPU blueprints (Standard, Advanced, and Flex Start) were updated to utilize the new tpu_topology variable, ensuring consistent labeling behavior across all consumption models.
  • Documentation: Established a modular documentation hierarchy, including a parent README for the DWS QP consumption model and child READMEs with hardware-specific guides for v6e and 7x, covering deployment steps, custom job requirements, and scale-up/down verification.
Changelog
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/README.md
    • Updated documentation formatting for clarity.
    • Added new sections for hardware-specific guides for TPU v6e and 7x.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/README.md
    • Added new documentation for deploying GKE clusters with TPU 7x nodes using DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x-deployment.yaml
    • Added new deployment configuration for TPU 7x with DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/gke-tpu-7x.yaml
    • Added new blueprint for GKE TPU 7x cluster with DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-7x/tpu-7x-test-job.yaml
    • Added new sample JobSet for testing TPU 7x DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-v6e/README.md
    • Added new documentation for deploying GKE clusters with TPU v6e nodes using DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-v6e/gke-tpu-v6e-deployment.yaml
    • Added new deployment configuration for TPU v6e with DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-v6e/gke-tpu-v6e.yaml
    • Added new blueprint for GKE TPU v6e cluster with DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/gke-tpu-v6e/tpu-v6e-test-job.yaml
    • Added new sample JobSet for testing TPU v6e DWS Queued Provisioning.
  • examples/gke-consumption-options/dws-flex-start-queued-provisioning/tpu-dws-queues.yaml.tftpl
    • Added new Kueue template for managing ProvisioningRequestConfig and AdmissionCheck for TPU resources.
  • examples/gke-consumption-options/dws-flex-start/gke-tpu-7x/gke-tpu-7x.yaml
    • Updated to include the tpu_topology variable in the GKE node pool configuration.
  • examples/gke-consumption-options/dws-flex-start/gke-tpu-v6e/gke-tpu-v6e.yaml
    • Updated to include the tpu_topology variable in the GKE node pool configuration.
  • examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml
    • Updated to include the tpu_topology variable in the GKE node pool configuration.
  • examples/gke-tpu-7x/gke-tpu-7x.yaml
    • Updated to include the tpu_topology variable in the GKE node pool configuration.
  • examples/gke-tpu-v6/gke-tpu-v6.yaml
    • Updated to include the tpu_topology variable in the GKE node pool configuration.
  • modules/compute/gke-node-pool/README.md
    • Documented the new tpu_topology input variable.
  • modules/compute/gke-node-pool/main.tf
    • Added a local variable kubernetes_labels to merge various Kubernetes labels, including cloud.google.com/gke-queued when queued provisioning is enabled.
    • Passed the tpu_topology variable to the internal tpu module.
    • Updated the labels attribute of the google_container_node_pool resource to use the merged kubernetes_labels.
    • Refined the placement_policy precondition to allow COMPACT placement when enable_queued_provisioning is true and TPUs are used.
  • modules/compute/gke-node-pool/variables.tf
    • Added the tpu_topology input variable.
  • modules/internal/tpu-definition/README.md
    • Documented the new tpu_topology input variable.
    • Documented the new kubernetes_label output.
  • modules/internal/tpu-definition/main.tf
    • Consolidated the logic for determining tpu_topology from either a direct variable or the placement_policy.
  • modules/internal/tpu-definition/outputs.tf
    • Updated the tpu_topology output to use the consolidated local variable.
    • Added a new kubernetes_label output to provide a map of Kubernetes labels for TPU discovery.
  • modules/internal/tpu-definition/variables.tf
    • Added the tpu_topology input variable.
Activity
  • The pull request was created by shubpal07, introducing new features and updates.
  • The pull request is currently awaiting review and further activity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new capabilities by adding support for GKE TPU DWS with Queued Provisioning for v6e and 7x hardware. The changes are extensive, including new blueprints, examples, and documentation, as well as thoughtful refactoring of core modules like gke-node-pool and tpu-definition to standardize TPU configuration and improve maintainability. While the core module changes are excellent, the new documentation and example files have several minor issues, including unresolved template variables, duplicated instructions, and inconsistent resource naming. Addressing these documentation and consistency issues, as detailed in the review comments, will significantly improve the usability and clarity of this new feature.

@shubpal07 shubpal07 self-assigned this Feb 10, 2026
@shubpal07 shubpal07 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Feb 10, 2026

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the examples/README.md file displays all existing examples information, I think it helps to highlight these new blueprints here as well: https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples#gke-consumption-options-

Maybe a statement that highlights that this folder includes A3U, TPU v6e, and TPU 7x examples.

@shubpal07

Copy link
Copy Markdown
Contributor Author

Babysit tests results:
🟢 success PR-test-gke
🟢 success PR-test-gke-a2-highgpu-kueue
🔴 failure PR-test-gke-a2-highgpu-kueue-onspot(try #2)
🟢 success PR-test-gke-a3-highgpu
🟢 success PR-test-gke-a3-highgpu-onspot
🟢 success PR-test-gke-a3-megagpu
🟢 success PR-test-gke-a3-megagpu-onspot(try #2)
🟢 success PR-test-gke-a3-ultragpu-onspot
🟢 success PR-test-gke-a4
🟢 success PR-test-gke-a4-onspot
🔴 failure PR-test-gke-a4x(try #2)
🟢 success PR-test-gke-g4
🟢 success PR-test-gke-h4d
🟢 success PR-test-gke-h4d-onspot
🟢 success PR-test-gke-inactive-reservation
🟢 success PR-test-gke-managed-hyperdisk
🟢 success PR-test-gke-managed-lustre
🟢 success PR-test-gke-storage
🟢 success PR-test-gke-tpu-7x(try #2)
🟢 success PR-test-gke-tpu-v6e
🟢 success PR-test-gke-tpu-v6e-flex
🟢 success PR-test-ml-gke
🟢 success PR-test-ml-gke-e2e
🟢 success PR-test-slurm-gke

@shubpal07

Copy link
Copy Markdown
Contributor Author

As the examples/README.md file displays all existing examples information, I think it helps to highlight these new blueprints here as well: https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples#gke-consumption-options-

Maybe a statement that highlights that this folder includes A3U, TPU v6e, and TPU 7x examples.

Thanks for mentioning @SwarnaBharathiMantena. Agree.
Added same in new commit

…er toolkit

Change-Id: I1f8e2443f5e16b5ceb07ac04c0257164766a2bf2

Change-Id: I4a08edd537023e4a483106382dfe0af1d5e7b51a

Change-Id: I1a7e5b00a8c684e3158e05e9f3b26f06cb29aa0b

Adding example/readme changes

Change-Id: Ie73fa56436d9a8d17cffbdf572c94e3ae1d1eab2

Change-Id: Ide0ae3a0b78753048cd359192eeed18b8479ae37
Change-Id: I5aad03dc9ef9ed455de497aa3df3b1071d3413ea
@shubpal07 shubpal07 merged commit bf7c3e1 into GoogleCloudPlatform:develop Feb 12, 2026
11 of 79 checks passed
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants