Skip to content

feat(job submission): Dynamic topology routing for gke jobs#5664

Merged
Neelabh94 merged 2 commits into
GoogleCloudPlatform:developfrom
Neelabh94:dynamic_topology
Jun 1, 2026
Merged

feat(job submission): Dynamic topology routing for gke jobs#5664
Neelabh94 merged 2 commits into
GoogleCloudPlatform:developfrom
Neelabh94:dynamic_topology

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Implements Dynamic Topology Routing in gcluster job submit to allow workloads to fall back to larger idle TPU/GPU pools when their preferred pool is full. This maximizes cluster utilization and reduces job starvation by providing more flexible scheduling constraints.

Key Changes

  • Support for Pipe Separator in Constraints: Users can now specify multiple allowed values for any node constraint separated by a pipe (e.g., --node-constraint "cloud.google.com/gke-tpu-topology=2x2|4x8").
  • Dynamic NodeAffinity Generation: When multiple values are detected for a constraint, gcluster now generates a Kubernetes nodeAffinity block with the In operator instead of a strict nodeSelector.
  • Smart Merging for Topology: If both a baseline --topology and fallback topologies via --node-constraint are specified, they are automatically merged into a single nodeAffinity list to avoid scheduling conflicts.
  • Documentation: Added examples and updated flag references in docs/gcluster_job_guide.md.

Tests:

  • Added unit tests in scheduling_test.go covering the new parsing and merging logic. All tests passed.

Verification Results

Verified via gcluster job submit --dry-run-out that the generated manifest correctly emits nodeAffinity with In operator for fallback values and keeps nodeSelector clean of strict topology requirements when fallbacks are present.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces dynamic topology routing for gcluster job submissions. By enabling pipe-separated values in node constraints, users can now define fallback options for their workloads, improving cluster utilization and reducing job starvation. The changes include updates to the scheduling logic to handle these constraints via nodeAffinity and comprehensive unit tests to ensure correct behavior.

Highlights

  • Dynamic Topology Routing: Implemented support for pipe-separated values in node constraints, allowing workloads to fall back to alternative TPU/GPU pools when the preferred pool is unavailable.
  • Smart Affinity Generation: Updated scheduling logic to automatically generate Kubernetes nodeAffinity blocks with the 'In' operator for flexible scheduling, replacing strict nodeSelectors.
  • Topology Merging: Added intelligent merging of baseline topology and fallback constraints to prevent scheduling conflicts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Neelabh94 Neelabh94 changed the title implement dynamic topology feat(job submission): Dynamic topology routing for gke jobs May 14, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation shifts these constraints from nodeSelector to nodeAffinity and includes 'smart merging' logic for TPU topologies to ensure baseline configurations are preserved. Documentation and unit tests have been updated accordingly. Review feedback suggests improving the readability of the affinity parsing logic, aligning naming conventions with Terraform providers, adding fail-fast validation for accelerator labels, and sanitizing input strings.

Comment thread pkg/orchestrator/gke/scheduling.go
@Neelabh94 Neelabh94 force-pushed the dynamic_topology branch 2 times, most recently from 9b01f11 to 04f9b40 Compare May 14, 2026 06:52
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change transitions the implementation from a strict nodeSelector to a more flexible nodeAffinity block when multiple values are provided. The update includes documentation for the new functionality, logic in the GKE orchestrator to handle smart merging of TPU topologies, and comprehensive unit tests. Feedback focuses on improving maintainability by defining common GKE labels as constants and simplifying the conditional logic used for merging affinity labels.

Comment thread pkg/orchestrator/gke/scheduling.go
Comment thread pkg/orchestrator/gke/scheduling.go Outdated
Comment thread pkg/orchestrator/gke/scheduling.go Outdated
Comment thread pkg/orchestrator/gke/scheduling.go
Comment thread pkg/orchestrator/gke/gke_job_orchestrator.go Outdated
@Neelabh94 Neelabh94 force-pushed the dynamic_topology branch 2 times, most recently from e8c05e9 to edf1b87 Compare May 14, 2026 07:41
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change shifts specific constraints from nodeSelector to nodeAffinity to support fallback mechanisms. Key updates include logic in GetNodeSelector and GetAffinity to handle these multi-value constraints and "smart merge" TPU topologies, along with corresponding documentation and unit tests. Feedback suggests simplifying the parsing logic to prevent invalid manifests and renaming the topology label constant to acceleratorTopologyLabel to align with Google Cloud Terraform provider naming conventions.

Comment thread pkg/orchestrator/gke/scheduling.go
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables dynamic routing and multiple topology support for GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation updates the GKE orchestrator to move these constraints from nodeSelector to nodeAffinity and merges them with baseline topologies. Feedback suggests that the merging logic should be adjusted for dynamic slicing workloads to use annotations instead of strict affinity and that the attribute name accelerator_topology should be used for alignment with the Google Cloud Terraform provider. A correction is also required for duplicate example numbering in the documentation.

Comment thread pkg/orchestrator/gke/scheduling.go Outdated
Comment thread docs/gcluster_job_guide.md
@Neelabh94 Neelabh94 force-pushed the dynamic_topology branch 3 times, most recently from d273fa1 to 02eb3a0 Compare May 14, 2026 10:41
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multiple topologies and dynamic routing in GKE jobs by allowing pipe-separated values in the --node-constraint flag. It updates the documentation with examples and a caution regarding dynamic slicing interactions. The underlying implementation in the GKE orchestrator was modified to handle these constraints via nodeAffinity instead of strict nodeSelector, including logic for smart merging of topology labels. Unit tests were added to verify the new scheduling and affinity logic. I have no feedback to provide as no review comments were present.

@Neelabh94 Neelabh94 added the release-improvements Added to release notes under the "Improvements" heading. label May 14, 2026
@Neelabh94 Neelabh94 marked this pull request as ready for review May 26, 2026 04:31
@Neelabh94 Neelabh94 requested a review from a team as a code owner May 26, 2026 04:31
Comment thread pkg/orchestrator/gke/scheduling.go Outdated
@Neelabh94 Neelabh94 force-pushed the dynamic_topology branch 2 times, most recently from 35fbda5 to 97f9cfe Compare June 1, 2026 07:28

@shubpal07 shubpal07 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A quick note for our follow-up FR on TAS:

Currently, IsDynamicSlicing is resolved inside the gcluster manager strictly by querying GKE's API for the PROVISION_ONLY placement policy (autoscaling).

  • The Limitation: Pre-provisioned static GCE reservations may NOT carry the PROVISION_ONLY policy. If a user tries to logically partition/sub-slice a static reservation pool, IsDynamicSlicing evaluates to false, and GCluster will hardcode the strict topology selector back into nodeSelector, triggering a scheduling deadlock.

  • Proposed Solution for Follow-up: In the next follow-up, we can refactor this to use a Topological Intent Check

  1. Instead of relying strictly on GKE's PROVISION_ONLY placement policy API check, GCluster's compiler should automatically evaluate the user's topological intent. The Intent Check: If the requested workload topology (e.g., 2x4 requested via --topology) is a proper subset / smaller than the GKE cluster's physically mapped node pool topology (e.g. 4x4), gcluster should automatically evaluate IsDynamicSlicing: true.

@Neelabh94 Neelabh94 merged commit 7c744a6 into GoogleCloudPlatform:develop Jun 1, 2026
14 of 76 checks passed
@Neelabh94 Neelabh94 deleted the dynamic_topology branch June 9, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants