fix: accelerator label auto resolution#5717
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the GKE job orchestrator to improve how it interacts with Google Cloud services. By replacing CLI-based calls with native API client calls, the orchestrator gains better error handling and more reliable metadata retrieval. Additionally, the changes standardize how accelerator labels are resolved, ensuring more consistent topology discovery across different machine types. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors GKE cluster metadata fetching to use the official Google Cloud Go SDK instead of executing gcloud CLI commands, improves error handling for permission issues, and updates topology discovery to filter by accelerator labels. The review feedback highlights a potential bug in accelerator classification where checking only isTPU might misclassify a TPU as a GPU. Additionally, the feedback suggests failing fast on unknown accelerator labels, avoiding an inefficient JSON round-trip when parsing cluster metadata, and preferring persistent pre-run initialization over lazy initialization for the GKE client service.
045917f to
c7ac841
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors TPU accelerator detection, adds a user-friendly permission error check when describing GKE clusters, and updates GKE topology auto-discovery to filter by accelerator labels. The reviewer identified three key issues: a regression in TPU detection due to the removal of a fallback check, a case-sensitivity issue in the permission error string matching, and a recommendation to fail fast on unknown accelerator labels to prevent invalid GKE configurations.
c7ac841 to
5e3635c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors TPU accelerator resolution and GKE topology discovery. It updates ResolveAcceleratorInfo to use IsTPU, adds helpful error handling for 403/permission denied errors when describing GKE clusters, and filters GKE node and Kueue resource flavor queries by the correct accelerator label. The reviewer recommended failing fast if the resolved accelerator label is unknown to prevent workloads from hanging or failing silently.
96ebfd9 to
df17284
Compare
|
/gemini review |
df17284 to
de0385d
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors GKE accelerator and topology resolution, introducing auto-discovery of topologies filtered by accelerator labels and requested chip counts, refactoring CPU machine determination, and cleaning up node selector construction. The review feedback suggests making the parsing of requested chips from the compute type more robust to handle multiple hyphens, and improving the safety of the topology chip calculation function to prevent false matches on invalid or empty topologies.
de0385d to
89ea31c
Compare
This PR fixes some bugs in the auto-resolution logic and improves error handling for missing gcloud permissions.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.