Skip to content

Add support for Kueue 0.11.1#3830

Merged
ighosh98 merged 12 commits into
GoogleCloudPlatform:developfrom
mwysokin:add-kueue-0.11.1
Mar 26, 2025
Merged

Add support for Kueue 0.11.1#3830
ighosh98 merged 12 commits into
GoogleCloudPlatform:developfrom
mwysokin:add-kueue-0.11.1

Conversation

@mwysokin

Copy link
Copy Markdown
Contributor

This PR makes the following changes:

  • Adds support for Kueue 0.11.1.
  • Consolidates non-TAS and TAS ResourceFlavors to a single ResourceFlavor which now supports both types of Workloads.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

…vors to a single ResourceFlavor which supports both.
@mwysokin mwysokin requested review from a team and samskillman as code owners March 24, 2025 16:23

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recommend customers use v0.1.11 as the default version?

Comment thread examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml Outdated
@mwysokin

Copy link
Copy Markdown
Contributor Author

Do you recommend customers use v0.1.11 as the default version?

Yes it actually makes things simpler for users. Some of the improvements are:

  • Single ResourceFlavor is now capable of supporting TAS and non-TAS workloads. Which means that only single quota is defined in the ClusterQueue.
  • There's a new BestFit placement policy which increases performance for dense jobs.
  • Workloads without any TAS annotation kueue.x-k8s.io/podset-preferred-topology or kueue.x-k8s.io/podset-required-topology will actually be scheduled with TAS but with a new unconstrained/auto mode and without rank ordering. This is needed to fight with fragmentation in clusters.

@ighosh98

Copy link
Copy Markdown
Contributor

Do you recommend customers use v0.1.11 as the default version?

Yes it actually makes things simpler for users. Some of the improvements are:

  • Single ResourceFlavor is now capable of supporting TAS and non-TAS workloads. Which means that only single quota is defined in the ClusterQueue.
  • There's a new BestFit placement policy which increases performance for dense jobs.
  • Workloads without any TAS annotation kueue.x-k8s.io/podset-preferred-topology or kueue.x-k8s.io/podset-required-topology will actually be scheduled with TAS but with a new unconstrained/auto mode and without rank ordering. This is needed to fight with fragmentation in clusters.

Thanks for confirming. Lets update the default version. We would also have to merge the Kueue v0.11.1 manifest in the manifest folder Will raise a PR and add you as a reviewer. Let's merge this PR after the manifest is merged.

@ighosh98 ighosh98 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Mar 24, 2025
@ighosh98

Copy link
Copy Markdown
Contributor

Raised #3833 to support this PR. Please fix the errors highlighted by the test suite.

@ighosh98

Copy link
Copy Markdown
Contributor

/gcbrun

@ighosh98 ighosh98 self-requested a review March 25, 2025 17:59
ighosh98
ighosh98 previously approved these changes Mar 25, 2025
annuay-google
annuay-google previously approved these changes Mar 25, 2025
@ighosh98

Copy link
Copy Markdown
Contributor

@mwysokin can you fix the build errors happening due to documentation

@ighosh98

ighosh98 commented Mar 25, 2025

Copy link
Copy Markdown
Contributor

Please update this doc: https://cloud.google.com/ai-hypercomputer/docs/workloads/schedule-gke-workloads-tas with the latest Kueue version and resolve the merge conflicts.

@mwysokin mwysokin dismissed stale reviews from annuay-google and ighosh98 via a8abd8b March 26, 2025 18:47
@ighosh98 ighosh98 self-requested a review March 26, 2025 18:52
@ighosh98

Copy link
Copy Markdown
Contributor

/gcbrun

@ighosh98 ighosh98 enabled auto-merge March 26, 2025 22:08
@ighosh98

Copy link
Copy Markdown
Contributor

/gcbrun

@ighosh98 ighosh98 merged commit 4af21e9 into GoogleCloudPlatform:develop Mar 26, 2025
@mwysokin mwysokin deleted the add-kueue-0.11.1 branch March 27, 2025 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants