Skip to content

TPU v6e DWS flex integration tests#5135

Merged
shubpal07 merged 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/dws-tpuv6e-integ
Feb 2, 2026
Merged

TPU v6e DWS flex integration tests#5135
shubpal07 merged 1 commit intoGoogleCloudPlatform:developfrom
shubpal07:shubham/dws-tpuv6e-integ

Conversation

@shubpal07
Copy link
Contributor

@shubpal07 shubpal07 commented Jan 27, 2026

Overview

This PR introduces a new integration test for GKE TPU v6e utilizing the DWS Flex Start (Dynamic Workload
Scheduling) consumption model.

Key Changes

  1. TPU v6e Flex Integration Test
  • Cloud Build Orchestration: Added tools/cloud-build/daily-tests/builds/gke-tpu-v6e-flex.yaml to manage the
    end-to-end lifecycle in CI/CD.
  • Test Configuration: Created tools/cloud-build/daily-tests/tests/gke-tpu-v6e-flex.yml with optimized settings
    for TPU v6e topologies.
  • Robust Validation: Implemented test-validation/test-gke-tpu-flex-autoscaling.yml which specifically validates
    the Flex Start lifecycle:
    • Verifies the cluster starts with 0 TPU nodes.
    • Detects the TriggeredScaleUp event with robust regex matching.
    • Ensures nodes are correctly labeled with cloud.google.com/gke-flex-start=true.
    • Captures logs from all job containers before scale-down occurs.
    • Verifies the cluster successfully scales back to 0 nodes after completion.

Verification Results

Cloud Build
Successfully validated via:

gcloud builds submit --config tools/cloud-build/daily-tests/builds/gke-tpu-v6e-flex.yaml

Local Execution
Successfully verified the full validation suite running from a local Cloud Workstation.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 self-assigned this Jan 27, 2026
@shubpal07 shubpal07 requested review from a team and samskillman as code owners January 27, 2026 08:03
@shubpal07 shubpal07 added the release-improvements Added to release notes under the "Improvements" heading. label Jan 27, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the GKE TPU offerings by introducing a new integration test for TPU v6e using the Dynamic Workload Scheduling (DWS) Flex Start consumption model. It provides a complete blueprint and documentation for deploying and managing dynamically scaled TPU clusters, alongside a detailed guide for local test execution. The changes ensure robust validation of the Flex Start lifecycle, from initial node provisioning to automatic scale-down, and integrate these tests into the CI/CD pipeline.

Highlights

  • TPU v6e Flex Integration Test: Introduced a new integration test for GKE TPU v6e utilizing the Dynamic Workload Scheduling (DWS) Flex Start consumption model, including Cloud Build orchestration and optimized test configurations.
  • Robust Flex Start Validation: Implemented comprehensive validation for the Flex Start lifecycle, verifying initial zero TPU nodes, detecting 'TriggeredScaleUp' events, ensuring correct node labeling, capturing job logs, and confirming scale-down to zero nodes.
  • Local Debugging Guide: Added a comprehensive guide to assist developers in running and debugging integration tests on local workstations, aiming to significantly reduce iteration time.
  • GKE Node Pool Autoscaling Enhancements: Modified the GKE node pool module to support per-zone minimum and maximum node counts for autoscaling, specifically to accommodate the requirements of TPU DWS Flex Start.
  • New DWS Flex Start Blueprints and Documentation: Added new blueprint examples and detailed READMEs for deploying GKE clusters with TPU 7x and TPU v6e using DWS Flex Start, including Kueue configurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new integration tests and documentation for GKE TPU v6e and 7x with DWS Flex Start. It also updates the gke-node-pool module to correctly handle per-zone autoscaling limits required for TPU Flex Start, including new variables and preconditions. The changes enhance the project's testing capabilities and documentation for these new features.

Copy link
Contributor Author

@shubpal07 shubpal07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed revision

Change-Id: I610b93194747bdfba9c58d3b489142f5b289af80

Change-Id: I4e0e25dbcaa97acaf6dab364b45c11d0f8c801e5

Change-Id: I8dd3c04ce61fd50a31c0d8085e100ce9fba8d45d

Change-Id: I89795626d0f7ff24bef420bdad9c7d8586d8bbb1

Change-Id: I529413e8a32c4f2982b9c02b31e6ae6eebcf5400

Change-Id: I92086bee858c216e2ff65acb70b5471c9a6f74a3

Change-Id: I1dfe968c8751307ec89cd6cfd18525d84662c4f3
@shubpal07 shubpal07 force-pushed the shubham/dws-tpuv6e-integ branch from c739da9 to 41d91fb Compare February 2, 2026 14:02
@shubpal07 shubpal07 merged commit df53797 into GoogleCloudPlatform:develop Feb 2, 2026
10 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants