feat(job submission): Dynamic topology routing for gke jobs#5664
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces dynamic topology routing for gcluster job submissions. By enabling pipe-separated values in node constraints, users can now define fallback options for their workloads, improving cluster utilization and reducing job starvation. The changes include updates to the scheduling logic to handle these constraints via nodeAffinity and comprehensive unit tests to ensure correct behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation shifts these constraints from nodeSelector to nodeAffinity and includes 'smart merging' logic for TPU topologies to ensure baseline configurations are preserved. Documentation and unit tests have been updated accordingly. Review feedback suggests improving the readability of the affinity parsing logic, aligning naming conventions with Terraform providers, adding fail-fast validation for accelerator labels, and sanitizing input strings.
9b01f11 to
04f9b40
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change transitions the implementation from a strict nodeSelector to a more flexible nodeAffinity block when multiple values are provided. The update includes documentation for the new functionality, logic in the GKE orchestrator to handle smart merging of TPU topologies, and comprehensive unit tests. Feedback focuses on improving maintainability by defining common GKE labels as constants and simplifying the conditional logic used for merging affinity labels.
e8c05e9 to
edf1b87
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change shifts specific constraints from nodeSelector to nodeAffinity to support fallback mechanisms. Key updates include logic in GetNodeSelector and GetAffinity to handle these multi-value constraints and "smart merge" TPU topologies, along with corresponding documentation and unit tests. Feedback suggests simplifying the parsing logic to prevent invalid manifests and renaming the topology label constant to acceleratorTopologyLabel to align with Google Cloud Terraform provider naming conventions.
edf1b87 to
b25aace
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enables dynamic routing and multiple topology support for GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation updates the GKE orchestrator to move these constraints from nodeSelector to nodeAffinity and merges them with baseline topologies. Feedback suggests that the merging logic should be adjusted for dynamic slicing workloads to use annotations instead of strict affinity and that the attribute name accelerator_topology should be used for alignment with the Google Cloud Terraform provider. A correction is also required for duplicate example numbering in the documentation.
d273fa1 to
02eb3a0
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for multiple topologies and dynamic routing in GKE jobs by allowing pipe-separated values in the --node-constraint flag. It updates the documentation with examples and a caution regarding dynamic slicing interactions. The underlying implementation in the GKE orchestrator was modified to handle these constraints via nodeAffinity instead of strict nodeSelector, including logic for smart merging of topology labels. Unit tests were added to verify the new scheduling and affinity logic. I have no feedback to provide as no review comments were present.
02eb3a0 to
a888f42
Compare
a888f42 to
e6f8b97
Compare
35fbda5 to
97f9cfe
Compare
97f9cfe to
83d17fd
Compare
shubpal07
left a comment
There was a problem hiding this comment.
LGTM.
A quick note for our follow-up FR on TAS:
Currently, IsDynamicSlicing is resolved inside the gcluster manager strictly by querying GKE's API for the PROVISION_ONLY placement policy (autoscaling).
-
The Limitation: Pre-provisioned static GCE reservations may NOT carry the
PROVISION_ONLYpolicy. If a user tries to logically partition/sub-slice a static reservation pool,IsDynamicSlicingevaluates to false, and GCluster will hardcode the strict topology selector back into nodeSelector, triggering a scheduling deadlock. -
Proposed Solution for Follow-up: In the next follow-up, we can refactor this to use a Topological Intent Check
- Instead of relying strictly on GKE's
PROVISION_ONLYplacement policy API check, GCluster's compiler should automatically evaluate the user's topological intent. The Intent Check: If the requested workload topology (e.g., 2x4 requested via --topology) is a proper subset / smaller than the GKE cluster's physically mapped node pool topology (e.g. 4x4),gclustershould automatically evaluate IsDynamicSlicing: true.
Implements Dynamic Topology Routing in gcluster job submit to allow workloads to fall back to larger idle TPU/GPU pools when their preferred pool is full. This maximizes cluster utilization and reduces job starvation by providing more flexible scheduling constraints.
Key Changes
Tests:
Verification Results
Verified via gcluster job submit --dry-run-out that the generated manifest correctly emits nodeAffinity with In operator for fallback values and keeps nodeSelector clean of strict topology requirements when fallbacks are present.