Skip to content

fix: accelerator label auto resolution#5717

Merged
Neelabh94 merged 1 commit into
GoogleCloudPlatform:developfrom
Neelabh94:fix-v6e-16-submission
Jun 1, 2026
Merged

fix: accelerator label auto resolution#5717
Neelabh94 merged 1 commit into
GoogleCloudPlatform:developfrom
Neelabh94:fix-v6e-16-submission

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented May 27, 2026

Copy link
Copy Markdown
Contributor

This PR fixes some bugs in the auto-resolution logic and improves error handling for missing gcloud permissions.

  • GKE Client Integration: Introduced a GKEClient interface and a default implementation to fetch cluster metadata directly via the Google Cloud SDK instead of relying on gcloud CLI commands.
  • Accelerator Label Resolution: Updated accelerator label resolution logic to use GKE node selector labels, improving accuracy in auto-discovering topologies.
  • Error Handling: Added specific error handling for 403 Forbidden responses when accessing GKE clusters, providing clearer instructions to users regarding required permissions.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the GKE job orchestrator to improve how it interacts with Google Cloud services. By replacing CLI-based calls with native API client calls, the orchestrator gains better error handling and more reliable metadata retrieval. Additionally, the changes standardize how accelerator labels are resolved, ensuring more consistent topology discovery across different machine types.

Highlights

  • GKE Client Integration: Introduced a GKEClient interface and a default implementation to fetch cluster metadata directly via the Google Cloud SDK instead of relying on gcloud CLI commands.
  • Accelerator Label Resolution: Updated accelerator label resolution logic to use GKE node selector labels, improving accuracy in auto-discovering topologies.
  • Error Handling: Added specific error handling for 403 Forbidden responses when accessing GKE clusters, providing clearer instructions to users regarding required permissions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Neelabh94 Neelabh94 added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 27, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors GKE cluster metadata fetching to use the official Google Cloud Go SDK instead of executing gcloud CLI commands, improves error handling for permission issues, and updates topology discovery to filter by accelerator labels. The review feedback highlights a potential bug in accelerator classification where checking only isTPU might misclassify a TPU as a GPU. Additionally, the feedback suggests failing fast on unknown accelerator labels, avoiding an inefficient JSON round-trip when parsing cluster metadata, and preferring persistent pre-run initialization over lazy initialization for the GKE client service.

Comment thread pkg/config/machine_configs.go
Comment thread pkg/orchestrator/gke/resource_resolver.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/types.go Outdated
@Neelabh94 Neelabh94 force-pushed the fix-v6e-16-submission branch 3 times, most recently from 045917f to c7ac841 Compare May 27, 2026 08:19
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors TPU accelerator detection, adds a user-friendly permission error check when describing GKE clusters, and updates GKE topology auto-discovery to filter by accelerator labels. The reviewer identified three key issues: a regression in TPU detection due to the removal of a fallback check, a case-sensitivity issue in the permission error string matching, and a recommendation to fail fast on unknown accelerator labels to prevent invalid GKE configurations.

Comment thread pkg/config/machine_configs.go
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go
@Neelabh94 Neelabh94 force-pushed the fix-v6e-16-submission branch from c7ac841 to 5e3635c Compare May 27, 2026 08:29
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors TPU accelerator resolution and GKE topology discovery. It updates ResolveAcceleratorInfo to use IsTPU, adds helpful error handling for 403/permission denied errors when describing GKE clusters, and filters GKE node and Kueue resource flavor queries by the correct accelerator label. The reviewer recommended failing fast if the resolved accelerator label is unknown to prevent workloads from hanging or failing silently.

Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go
@Neelabh94 Neelabh94 force-pushed the fix-v6e-16-submission branch 2 times, most recently from 96ebfd9 to df17284 Compare May 27, 2026 11:10
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@Neelabh94 Neelabh94 force-pushed the fix-v6e-16-submission branch from df17284 to de0385d Compare May 27, 2026 11:14

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors GKE accelerator and topology resolution, introducing auto-discovery of topologies filtered by accelerator labels and requested chip counts, refactoring CPU machine determination, and cleaning up node selector construction. The review feedback suggests making the parsing of requested chips from the compute type more robust to handle multiple hyphens, and improving the safety of the topology chip calculation function to prevent false matches on invalid or empty topologies.

Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go
@Neelabh94 Neelabh94 force-pushed the fix-v6e-16-submission branch from de0385d to 89ea31c Compare May 27, 2026 11:24
@Neelabh94 Neelabh94 marked this pull request as ready for review May 27, 2026 11:42
@Neelabh94 Neelabh94 requested a review from a team as a code owner May 27, 2026 11:42
@Neelabh94 Neelabh94 merged commit 361e73c into GoogleCloudPlatform:develop Jun 1, 2026
16 of 79 checks passed
@Neelabh94 Neelabh94 deleted the fix-v6e-16-submission branch June 1, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants