Skip to content

fix: Add tpu_topology conditional logic for TPU flex start #5655

Merged
agrawalkhushi18 merged 2 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:flex-tpu
May 14, 2026
Merged

fix: Add tpu_topology conditional logic for TPU flex start #5655
agrawalkhushi18 merged 2 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:flex-tpu

Conversation

@agrawalkhushi18

@agrawalkhushi18 agrawalkhushi18 commented May 13, 2026

Copy link
Copy Markdown
Contributor

This PR resolves test validation failure in test_deployment_variable_not_used for the flex_start TPU blueprint.

The validator was incorrectly flagging tpu_topology as unused when placement_policy was injected via hardware.go logic for deployments with static node count > 1.

Key changes

  • Removed the injectCompactPlacementPolicy function since this is already being handled in the blueprint modules.
  • Also added a conditional logic to avoid static_node_count calculation in case of flex_start since it supports auto_scaling.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for TPU Flex Start configurations within the hardware settings. It ensures that the configuration logic respects existing topology expressions and correctly bypasses static node count calculations when Flex Start is enabled, improving compatibility with dynamic TPU provisioning.

Highlights

  • TPU Topology Preservation: Updated the placement policy logic to preserve existing tpu_topology expressions, preventing potential conflicts with validators.
  • Flex Start Support: Introduced conditional logic to skip static node count calculation when 'enable_flex_start' is set to true.
  • Unit Testing: Added a new test case to verify that static_node_count is correctly omitted for Flex Start pools.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the hardware configuration logic to preserve existing topology expressions in placement policies and introduces a skip for node calculations when 'enable_flex_start' is active. Feedback suggests implementing defensive checks for null or unknown values when evaluating boolean settings to prevent runtime panics. Additionally, it is recommended to use the attribute name 'accelerator_topology' instead of 'tpu_topology' to ensure consistency with the underlying Terraform provider.

Comment thread pkg/config/hardware.go Outdated
Comment thread pkg/config/hardware_test.go
@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review May 14, 2026 06:15
@agrawalkhushi18 agrawalkhushi18 requested a review from a team as a code owner May 14, 2026 06:15
@agrawalkhushi18 agrawalkhushi18 added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 14, 2026
Comment thread pkg/config/hardware.go

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@agrawalkhushi18

Copy link
Copy Markdown
Contributor Author

The PR-test-gke-tpu-v6e-flex passed the test_deployment_variable_not_used validation and hence the error is resolved. The failure is due to capacity constraints.

@agrawalkhushi18 agrawalkhushi18 merged commit b0d62a0 into GoogleCloudPlatform:develop May 14, 2026
25 of 94 checks passed
Thibaut-Nurit pushed a commit to Thibaut-Nurit/cluster-toolkit that referenced this pull request May 20, 2026
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants