feat(job submission): Dynamic topology routing for gke jobs by Neelabh94 · Pull Request #5664 · GoogleCloudPlatform/cluster-toolkit

Neelabh94 · 2026-05-14T06:36:50Z

Implements Dynamic Topology Routing in gcluster job submit to allow workloads to fall back to larger idle TPU/GPU pools when their preferred pool is full. This maximizes cluster utilization and reduces job starvation by providing more flexible scheduling constraints.

Key Changes

Support for Pipe Separator in Constraints: Users can now specify multiple allowed values for any node constraint separated by a pipe (e.g., --node-constraint "cloud.google.com/gke-tpu-topology=2x2|4x8").
Dynamic NodeAffinity Generation: When multiple values are detected for a constraint, gcluster now generates a Kubernetes nodeAffinity block with the In operator instead of a strict nodeSelector.
Smart Merging for Topology: If both a baseline --topology and fallback topologies via --node-constraint are specified, they are automatically merged into a single nodeAffinity list to avoid scheduling conflicts.
Documentation: Added examples and updated flag references in docs/gcluster_job_guide.md.

Tests:

Added unit tests in scheduling_test.go covering the new parsing and merging logic. All tests passed.

Verification Results

Verified via gcluster job submit --dry-run-out that the generated manifest correctly emits nodeAffinity with In operator for fallback values and keeps nodeSelector clean of strict topology requirements when fallbacks are present.

gemini-code-assist · 2026-05-14T06:37:17Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces dynamic topology routing for gcluster job submissions. By enabling pipe-separated values in node constraints, users can now define fallback options for their workloads, improving cluster utilization and reducing job starvation. The changes include updates to the scheduling logic to handle these constraints via nodeAffinity and comprehensive unit tests to ensure correct behavior.

Highlights

Dynamic Topology Routing: Implemented support for pipe-separated values in node constraints, allowing workloads to fall back to alternative TPU/GPU pools when the preferred pool is unavailable.
Smart Affinity Generation: Updated scheduling logic to automatically generate Kubernetes nodeAffinity blocks with the 'In' operator for flexible scheduling, replacing strict nodeSelectors.
Topology Merging: Added intelligent merging of baseline topology and fallback constraints to prevent scheduling conflicts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation shifts these constraints from nodeSelector to nodeAffinity and includes 'smart merging' logic for TPU topologies to ensure baseline configurations are preserved. Documentation and unit tests have been updated accordingly. Review feedback suggests improving the readability of the affinity parsing logic, aligning naming conventions with Terraform providers, adding fail-fast validation for accelerator labels, and sanitizing input strings.

Neelabh94 · 2026-05-14T06:55:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change transitions the implementation from a strict nodeSelector to a more flexible nodeAffinity block when multiple values are provided. The update includes documentation for the new functionality, logic in the GKE orchestrator to handle smart merging of TPU topologies, and comprehensive unit tests. Feedback focuses on improving maintainability by defining common GKE labels as constants and simplifying the conditional logic used for merging affinity labels.

Neelabh94 · 2026-05-14T07:42:38Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for dynamic routing and multiple topologies in GKE jobs by allowing pipe-separated values in the --node-constraint flag. This change shifts specific constraints from nodeSelector to nodeAffinity to support fallback mechanisms. Key updates include logic in GetNodeSelector and GetAffinity to handle these multi-value constraints and "smart merge" TPU topologies, along with corresponding documentation and unit tests. Feedback suggests simplifying the parsing logic to prevent invalid manifests and renaming the topology label constant to acceleratorTopologyLabel to align with Google Cloud Terraform provider naming conventions.

Neelabh94 · 2026-05-14T09:51:08Z

/gemini review

gemini-code-assist

Code Review

This pull request enables dynamic routing and multiple topology support for GKE jobs by allowing pipe-separated values in the --node-constraint flag. The implementation updates the GKE orchestrator to move these constraints from nodeSelector to nodeAffinity and merges them with baseline topologies. Feedback suggests that the merging logic should be adjusted for dynamic slicing workloads to use annotations instead of strict affinity and that the attribute name accelerator_topology should be used for alignment with the Google Cloud Terraform provider. A correction is also required for duplicate example numbering in the documentation.

Neelabh94 · 2026-05-14T10:43:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for multiple topologies and dynamic routing in GKE jobs by allowing pipe-separated values in the --node-constraint flag. It updates the documentation with examples and a caution regarding dynamic slicing interactions. The underlying implementation in the GKE orchestrator was modified to handle these constraints via nodeAffinity instead of strict nodeSelector, including logic for smart merging of topology labels. Unit tests were added to verify the new scheduling and affinity logic. I have no feedback to provide as no review comments were present.

shubpal07

LGTM.

A quick note for our follow-up FR on TAS:

Currently, IsDynamicSlicing is resolved inside the gcluster manager strictly by querying GKE's API for the PROVISION_ONLY placement policy (autoscaling).

The Limitation: Pre-provisioned static GCE reservations may NOT carry the PROVISION_ONLY policy. If a user tries to logically partition/sub-slice a static reservation pool, IsDynamicSlicing evaluates to false, and GCluster will hardcode the strict topology selector back into nodeSelector, triggering a scheduling deadlock.
Proposed Solution for Follow-up: In the next follow-up, we can refactor this to use a Topological Intent Check

Instead of relying strictly on GKE's PROVISION_ONLY placement policy API check, GCluster's compiler should automatically evaluate the user's topological intent. The Intent Check: If the requested workload topology (e.g., 2x4 requested via --topology) is a proper subset / smaller than the GKE cluster's physically mapped node pool topology (e.g. 4x4), gcluster should automatically evaluate IsDynamicSlicing: true.

Neelabh94 changed the title ~~implement dynamic topology~~ feat(job submission): Dynamic topology routing for gke jobs May 14, 2026