fix: Update hardware.go for tpu_topology extraction through workload_policy#5600
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the handling of TPU configurations within the toolkit. By enabling the extraction of topology data from workload policies and implementing stricter validation for machine types, the system now more reliably automates the calculation of node counts. These changes simplify configuration requirements for users while ensuring that hardware-specific logic is applied only when appropriate. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces logic to extract TPU topology from workload policies and ensures that hardware settings expansion, such as node count calculation, is only applied to TPU machine types. It also includes a unit test for this filtering logic and removes redundant hardcoded node counts in a daily test configuration. Feedback was provided to rename the accelerator_topology attribute to tpu_topology for consistency with existing configuration naming conventions.
Neelabh94
left a comment
There was a problem hiding this comment.
LGTM!
Please look into the suggestion of adding more unit tests.
The Issue
The automated
static_node_countcalculation for TPU 7x (introduced in PR #5386) failed when thetpu_topologywas defined in a referencedworkload_policymodule rather than directly in the node pool's settings. Because the Go logic could not resolve module references at expansion time, it skipped the calculation, defaultedstatic_node_countto null, and fell back to invalid cluster autoscaling settings, causing deployment failures.Summary of Changes
Updated hardware.go - Added logic to extract
tpu_topologyfromworkload_policymodulesRefactored extraction logic- Created a new
extractTopologyFromWorkloadPolicy()function to handle TPU topology discovery in used modulesAdded TPU validation - Added
IsTPU()check to ensure calculation only runs on valid TPU machine typesEnhanced test coverage - Added comprehensive unit tests for the new workload_policy extraction for 2D and 3D topologies
Removed hardcoded values - Cleaned up test files by removing hardcoded
static_node_countvalues to validate automationSubmission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.