Skip to content

fix: Update hardware.go for tpu_topology extraction through workload_policy#5600

Merged
Neelabh94 merged 5 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:tpu7x=test
May 7, 2026
Merged

fix: Update hardware.go for tpu_topology extraction through workload_policy#5600
Neelabh94 merged 5 commits into
GoogleCloudPlatform:developfrom
agrawalkhushi18:tpu7x=test

Conversation

@agrawalkhushi18

@agrawalkhushi18 agrawalkhushi18 commented May 5, 2026

Copy link
Copy Markdown
Contributor

The Issue
The automated static_node_count calculation for TPU 7x (introduced in PR #5386) failed when the tpu_topology was defined in a referenced workload_policy module rather than directly in the node pool's settings. Because the Go logic could not resolve module references at expansion time, it skipped the calculation, defaulted static_node_count to null, and fell back to invalid cluster autoscaling settings, causing deployment failures.

Summary of Changes

  • Updated hardware.go - Added logic to extract tpu_topology from workload_policy modules

  • Refactored extraction logic- Created a new extractTopologyFromWorkloadPolicy() function to handle TPU topology discovery in used modules

  • Added TPU validation - Added IsTPU() check to ensure calculation only runs on valid TPU machine types

  • Enhanced test coverage - Added comprehensive unit tests for the new workload_policy extraction for 2D and 3D topologies

  • Removed hardcoded values - Cleaned up test files by removing hardcoded static_node_count values to validate automation

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the handling of TPU configurations within the toolkit. By enabling the extraction of topology data from workload policies and implementing stricter validation for machine types, the system now more reliably automates the calculation of node counts. These changes simplify configuration requirements for users while ensuring that hardware-specific logic is applied only when appropriate.

Highlights

  • TPU Topology Extraction: Added functionality to extract TPU topology information directly from the workload_policy configuration.
  • Machine Type Validation: Introduced a check to ensure static_node_count is only calculated for valid TPU machine types.
  • Test Coverage: Added a unit test to verify that non-TPU machine types correctly skip the static_node_count calculation.
  • Configuration Cleanup: Removed static_node_count from the gke-tpu-7x test configuration as it is now handled automatically.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to extract TPU topology from workload policies and ensures that hardware settings expansion, such as node count calculation, is only applied to TPU machine types. It also includes a unit test for this filtering logic and removes redundant hardcoded node counts in a daily test configuration. Feedback was provided to rename the accelerator_topology attribute to tpu_topology for consistency with existing configuration naming conventions.

Comment thread pkg/config/hardware.go
@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review May 5, 2026 10:16
@agrawalkhushi18 agrawalkhushi18 requested a review from a team as a code owner May 5, 2026 10:16
@agrawalkhushi18 agrawalkhushi18 added release-improvements Added to release notes under the "Improvements" heading. labels May 5, 2026
Comment thread pkg/config/hardware_test.go
Neelabh94
Neelabh94 previously approved these changes May 6, 2026

@Neelabh94 Neelabh94 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Please look into the suggestion of adding more unit tests.

@agrawalkhushi18 agrawalkhushi18 requested a review from shubpal07 May 6, 2026 10:37
@Neelabh94 Neelabh94 merged commit d6e60ef into GoogleCloudPlatform:develop May 7, 2026
17 of 81 checks passed
Neelabh94 pushed a commit to Neelabh94/cluster-toolkit that referenced this pull request May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants