Skip to content

fix(gke): Missing Pathways Quotas in Kueue#5645

Merged
Neelabh94 merged 5 commits into
GoogleCloudPlatform:developfrom
Neelabh94:pathways_template
May 12, 2026
Merged

fix(gke): Missing Pathways Quotas in Kueue#5645
Neelabh94 merged 5 commits into
GoogleCloudPlatform:developfrom
Neelabh94:pathways_template

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Problem Summary:

The Issue: When deploying GKE TPU clusters with Pathways enabled (enable_pathways_for_tpus: true), jobs would fail to schedule and remain in a Pending or Suspended state. While the cluster successfully provisioned the necessary cpu-np node pool for Pathways, the Kueue ClusterQueue was only configured to cover google.com/tpu resources, lacking coverage for the cpu and memory requested by the Pathways head pods.

Root Cause: The problem stemmed from missing variable plumbing between blueprint modules and limitations in the ClusterQueue merging logic in the kubectl-apply module:

  • Missing Flag Propagation: The enable_pathways_for_tpus flag was set on the gke-cluster module (triggering node pool creation) but was not visible to the kubectl-apply module (responsible for Kueue config), so the internal pathways template with CPU/Memory quotas was never loaded.
  • Destructive Merging: Even if loaded, the previous merging logic in kubectl-apply would overwrite user-defined fields in the ClusterQueue (like cohort or namespace selectors) with the pathways template defaults, instead of combining them.
  • Name Mismatch: The DWS-specific templates used a different ClusterQueue name (dws-cluster-queue), preventing the auto-merge logic which relied on the standard name cluster-queue.

Solution:To resolve this comprehensively and seamlessly without resorting to hardcoded values in user templates:

  • Seamless Variable Plumbing:

    • Declared enable_pathways_for_tpus as a top-level variable in the vars section of the blueprints and passed it to the gke-cluster module.
    • Added output "enable_pathways_for_tpus" to the gke-cluster module and a matching top-level variable "enable_pathways_for_tpus" to the kubectl-apply module.
      This allows Cluster Toolkit to automatically wire the flag from gke-cluster to kubectl-apply via the use mechanism, removing the need to manually pass it in the blueprint for kubectl-apply.
  • Robust ClusterQueue Merging:

    • Updated kubectl-apply/main.tf to perform a merge of the spec and metadata maps of matching ClusterQueue resources. This ensures that user-defined fields (like cohort, labels, and annotations) are preserved while still combining the resourceGroups arrays.
    • Added robust null-safety checks to the merging logic to handle cases where metadata or spec might be empty/null in YAML documents.
  • Template Alignment:

    • Aligned the DWS template (tpu-dws-queues.yaml.tftpl) to use the standard cluster-queue name, enabling successful auto-merging with the Pathways quotas.

This fix ensures that Pathways quotas are automatically and safely injected into the user's ClusterQueue across all 8 relevant TPU blueprints.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@Neelabh94 Neelabh94 changed the base branch from main to develop May 12, 2026 07:06
@Neelabh94 Neelabh94 marked this pull request as ready for review May 12, 2026 07:07
@Neelabh94 Neelabh94 requested a review from a team as a code owner May 12, 2026 07:07
@Neelabh94 Neelabh94 added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request parameterizes the enable_pathways_for_tpus setting across multiple GKE TPU blueprints by introducing a global variable. However, feedback indicates that in DWS-based blueprints, a naming mismatch between the defined dws-cluster-queue and the hardcoded cluster-queue in the kubectl-apply module will cause scheduling failures. Additionally, the current shallow merging logic in the kubectl-apply module risks dropping critical configuration fields such as cohort or preemption, suggesting a need for a deep merge implementation.

Comment thread examples/gke-tpu-v6e/gke-tpu-v6e.yaml Outdated
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes the enable_pathways_for_tpus configuration across several GKE TPU blueprints by introducing a top-level variable and linking it to the relevant modules. It also renames a ClusterQueue in a template and updates the kubectl-apply module to merge all spec fields for ClusterQueue resources. Feedback was provided to extend this merging logic to the metadata field to ensure user-defined labels and annotations are preserved.

Comment thread modules/management/kubectl-apply/main.tf Outdated
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the enable_pathways_for_tpus variable across several GKE TPU blueprints to automate CPU node pool provisioning and Kueue quota configuration. It also updates the kubectl-apply module to support merging ClusterQueue metadata and specifications. Review feedback identifies a critical need to ensure all template files are updated to the new cluster-queue name to avoid scheduling failures. Additionally, the Terraform logic in kubectl-apply should be hardened to handle null values during the merge process.

Comment thread examples/gke-tpu-7x/gke-tpu-7x.yaml
Comment thread modules/management/kubectl-apply/main.tf Outdated
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes the configuration of Pathways for TPUs across multiple GKE blueprints by introducing a top-level enable_pathways_for_tpus variable and wiring it through the gke-cluster and kubectl-apply modules. It also includes a rename of the cluster queue in the DWS flex-start template and updates the kubectl-apply module to better merge ClusterQueue resources. Review feedback highlights a missing variable wiring in the gke-tpu-v6e example and recommends flattening admissionChecks during the merge process to prevent configuration overwrites.

Comment thread examples/gke-tpu-v6e/gke-tpu-v6e.yaml
Comment thread modules/management/kubectl-apply/main.tf
@Neelabh94 Neelabh94 enabled auto-merge (squash) May 12, 2026 08:52

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Neelabh94 Neelabh94 merged commit a879901 into GoogleCloudPlatform:develop May 12, 2026
22 of 87 checks passed
@Neelabh94 Neelabh94 deleted the pathways_template branch May 12, 2026 10:54

@shubpal07 shubpal07 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to audit All Templates to ensure kueue-configuration.yaml.tftpl and all other blueprint-specific templates are updated to the new cluster-queue name. For example https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-consumption-options/dws-flex-start-queued-provisioning/nccl-jobset-example.yaml#L21 references dws-local-queue

We may also need documentation updates in some readme which references dws-local-queue. Ex- https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-consumption-options/dws-flex-start-queued-provisioning/README.md

@Neelabh94

Copy link
Copy Markdown
Contributor Author

We may need to audit All Templates to ensure kueue-configuration.yaml.tftpl and all other blueprint-specific templates are updated to the new cluster-queue name. For example https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-consumption-options/dws-flex-start-queued-provisioning/nccl-jobset-example.yaml#L21 references dws-local-queue

We may also need documentation updates in some readme which references dws-local-queue. Ex- https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/gke-consumption-options/dws-flex-start-queued-provisioning/README.md

Thanks for the review, Shubham. To clarify, the name of the LocalQueue remains dws-local-queue across all templates and documentation. This PR specifically standardizes the name of the underlying ClusterQueue to cluster-queue (the resource referenced by the LocalQueue) to enable the seamless merging of Pathways quota. Since job templates and user workloads interact with the LocalQueue name, existing workflows will continue to function without any modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants