fix: Update hardware.go for tpu_topology extraction through workload_policy by agrawalkhushi18 · Pull Request #5600 · GoogleCloudPlatform/cluster-toolkit

agrawalkhushi18 · 2026-05-05T07:42:57Z

The Issue
The automated static_node_count calculation for TPU 7x (introduced in PR #5386) failed when the tpu_topology was defined in a referenced workload_policy module rather than directly in the node pool's settings. Because the Go logic could not resolve module references at expansion time, it skipped the calculation, defaulted static_node_count to null, and fell back to invalid cluster autoscaling settings, causing deployment failures.

Summary of Changes

Updated hardware.go - Added logic to extract tpu_topology from workload_policy modules
Refactored extraction logic- Created a new extractTopologyFromWorkloadPolicy() function to handle TPU topology discovery in used modules
Added TPU validation - Added IsTPU() check to ensure calculation only runs on valid TPU machine types
Enhanced test coverage - Added comprehensive unit tests for the new workload_policy extraction for 2D and 3D topologies
Removed hardcoded values - Cleaned up test files by removing hardcoded static_node_count values to validate automation

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

…ode_count

gemini-code-assist · 2026-05-05T07:43:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the handling of TPU configurations within the toolkit. By enabling the extraction of topology data from workload policies and implementing stricter validation for machine types, the system now more reliably automates the calculation of node counts. These changes simplify configuration requirements for users while ensuring that hardware-specific logic is applied only when appropriate.

Highlights

TPU Topology Extraction: Added functionality to extract TPU topology information directly from the workload_policy configuration.
Machine Type Validation: Introduced a check to ensure static_node_count is only calculated for valid TPU machine types.
Test Coverage: Added a unit test to verify that non-TPU machine types correctly skip the static_node_count calculation.
Configuration Cleanup: Removed static_node_count from the gke-tpu-7x test configuration as it is now handled automatically.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces logic to extract TPU topology from workload policies and ensures that hardware settings expansion, such as node count calculation, is only applied to TPU machine types. It also includes a unit test for this filtering logic and removes redundant hardcoded node counts in a daily test configuration. Feedback was provided to rename the accelerator_topology attribute to tpu_topology for consistency with existing configuration naming conventions.

Neelabh94

LGTM!

Please look into the suggestion of adding more unit tests.

…policy (GoogleCloudPlatform#5600)

Updating hardware.go to resolve tpu_topology propogation for static_n…

15b61b0

…ode_count

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread pkg/config/hardware.go

updating v6e file as well

fb04b46

agrawalkhushi18 marked this pull request as ready for review May 5, 2026 10:16

agrawalkhushi18 requested a review from a team as a code owner May 5, 2026 10:16

agrawalkhushi18 requested review from Neelabh94 and SwarnaBharathiMantena May 5, 2026 10:18

agrawalkhushi18 added release-improvements Added to release notes under the "Improvements" heading. labels May 5, 2026

shubpal07 reviewed May 5, 2026

View reviewed changes

Comment thread pkg/config/hardware_test.go

Merge branch 'develop' into tpu7x=test

c2737b8

Neelabh94 previously approved these changes May 6, 2026

View reviewed changes

Adding workload_policy test cases

a742afa

agrawalkhushi18 dismissed Neelabh94’s stale review via a742afa May 6, 2026 08:08

nit

8a233ce

agrawalkhushi18 requested a review from shubpal07 May 6, 2026 10:37

Neelabh94 approved these changes May 7, 2026

View reviewed changes

shubpal07 approved these changes May 7, 2026

View reviewed changes

Neelabh94 merged commit d6e60ef into GoogleCloudPlatform:develop May 7, 2026
17 of 81 checks passed

Neelabh94 pushed a commit to Neelabh94/cluster-toolkit that referenced this pull request May 7, 2026

fix: Update hardware.go for tpu_topology extraction through workload_…

5f3a9f7

…policy (GoogleCloudPlatform#5600)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Update hardware.go for tpu_topology extraction through workload_policy#5600

fix: Update hardware.go for tpu_topology extraction through workload_policy#5600
Neelabh94 merged 5 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:tpu7x=test

agrawalkhushi18 commented May 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Neelabh94 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

agrawalkhushi18 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Submission Checklist

Uh oh!

gemini-code-assist Bot commented May 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Neelabh94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

agrawalkhushi18 commented May 5, 2026 •

edited

Loading