Skip to content

Conversation

@olekzabl
Copy link
Contributor

@olekzabl olekzabl commented Nov 14, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5590

Special notes for your reviewer:

This PR is currently split into 2 commits:

  • The first commit is small, and fixes the issue.
    (In short, we're roughly aligned with @mimowo that it does.
    For details, see the discussion in Kueue Scheduler getting stuck if preemption request fails #5590, starting from this comment)

  • The second commit is my attempt to give this fix any reasonable test coverage.
    I chose integration tests (to verify the whole path of "preemption error -> retry -> desired outcome").
    However, for that, I needed selectively injecting fake errors into K8s client, in integration tests, which AFAICS has been never done in Kueue.
    Hence, I needed some custom tweaking of the test setup to achieve that.
    I'm curious about reviewers' opinions whether this is nice enough.

The test coverage may be not yet as full as I'd wish (for example, I haven't cared for testing the "preemption failed" vs. "sticky workloads" interaction); yet, as I'm hearing this issue gained some urgency, I'm un-drafting this PR now, hoping that it may be good to go as it is. Then, further tests may come in a follow-up PR. (Tracked in issue #7806).

Does this PR introduce a user-facing change?

Fix a bug that an error during workload preemption could leave the scheduler stuck without retrying.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 14, 2025
@netlify
Copy link

netlify bot commented Nov 14, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit caf3b8b
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69204ad5c33d9d0008be211f

@k8s-ci-robot
Copy link
Contributor

Hi @olekzabl. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 14, 2025
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 14, 2025
@mimowo
Copy link
Contributor

mimowo commented Nov 14, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 14, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 18, 2025
@olekzabl
Copy link
Contributor Author

@gabesaba @mbobrovskyi May I ask for a review at this point?

While this is still technically a draft (due to non-perfect test coverage), I'd like to verify my testing approach with you before I continue work on that. And the code written so far I've already tried to "polish" reasonably.

Please see the PR description for more context.

@olekzabl
Copy link
Contributor Author

/test

@k8s-ci-robot
Copy link
Contributor

@olekzabl: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test pull-kueue-build-image-main
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-kueueviz-main
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-e2e-multikueue-main
/test pull-kueue-test-e2e-tas-main
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main
/test pull-kueue-test-scheduling-perf-main
/test pull-kueue-test-unit-main
/test pull-kueue-verify-main

Use /test all to run all jobs.

Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@olekzabl
Copy link
Contributor Author

/test all

@olekzabl
Copy link
Contributor Author

/retest

@olekzabl
Copy link
Contributor Author

/retest

@olekzabl olekzabl marked this pull request as ready for review November 20, 2025 09:10
@olekzabl olekzabl requested a review from pajakd November 21, 2025 09:11
Comment on lines +4154 to +4156
if failed != 0 {
t.Errorf("Reported %d failed preemptions, want 0", failed)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered adding unit test(s) for this change (here or in pkg/scheduler/scheduler_test.go)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered - but (as mentioned in description of this PR) I started with smaller coverage due to reported urgency of this PR.

Since you ask about it, I opened a follow-up issue: #7806

return fallThrough, nil
}
// Ignore patches triggered by util.FinishEvictionForWorkloads()
if cond := apimeta.FindStatusCondition(wl.Status.Conditions, kueue.WorkloadQuotaReserved); cond != nil && cond.Status == metav1.ConditionFalse && cond.Message == "By test" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if someone changes that "By test" message, we might start getting failures here? Perhaps add a constant to test/util/constants.go with this message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it - but then I noticed lots of "by test" literals spread across our tests.
So I concluded adding such constant would go a bit against our current conventions.

@pajakd
Copy link
Contributor

pajakd commented Nov 21, 2025

Left a few non-blocking comments but overall lgtm

}

func (f *Framework) StartManager(ctx context.Context, cfg *rest.Config, managerSetup ManagerSetup) {
func (f *Framework) StartManager(ctx context.Context, cfg *rest.Config, managerSetup ManagerSetup, opts ...ManagerSetupOption) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i see the issue.

Well then as I stated - I think updating mgrOptions directly instead of using managerSetupOptions removes unnecessary busywork and the cost of direct access to options seems negligible given we are talking about tests.

But leaving as is is also okay - leaving it up to you to decide.

@PBundyra
Copy link
Contributor

Nice, thanks!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: d3284f65ff732b0c8d8308bb31a8f4bf98bf36ad

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks quite good 👍
/approve
/cherrypick release-0.14
/cherrypick release-0.13

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, olekzabl, Singularity23x0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2025
@mimowo
Copy link
Contributor

mimowo commented Nov 21, 2025

@olekzabl please prepare CPs manually in case the automated cherrypick fails

@k8s-ci-robot k8s-ci-robot merged commit ea52fbe into kubernetes-sigs:main Nov 21, 2025
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Nov 21, 2025
@olekzabl
Copy link
Contributor Author

@olekzabl please prepare CPs manually in case the automated cherrypick fails

Hmm, I don't see any activity from k8s-infra-cherrypick-robot following your earlier cherrypick slash-commands. No idea why.

I'll retry these commands below, though not sure if I have permissions for this.

@olekzabl
Copy link
Contributor Author

/cherrypick release-0.14

@k8s-infra-cherrypick-robot
Copy link
Contributor

@olekzabl: #7665 failed to apply on top of branch "release-0.14":

Applying: Add `RequeueReasonPreemptionFailed`
Using index info to reconstruct a base tree...
M	pkg/cache/queue/cluster_queue.go
M	pkg/scheduler/preemption/preemption.go
M	pkg/scheduler/preemption/preemption_hierarchical_test.go
M	pkg/scheduler/preemption/preemption_test.go
M	pkg/scheduler/scheduler.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
Auto-merging pkg/scheduler/preemption/preemption_test.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption_test.go
Auto-merging pkg/scheduler/preemption/preemption_hierarchical_test.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption_hierarchical_test.go
Auto-merging pkg/scheduler/preemption/preemption.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption.go
Auto-merging pkg/cache/queue/cluster_queue.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Add `RequeueReasonPreemptionFailed`

Details

In response to this:

/cherrypick release-0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@olekzabl
Copy link
Contributor Author

/cherrypick release-0.13

@k8s-infra-cherrypick-robot
Copy link
Contributor

@olekzabl: #7665 failed to apply on top of branch "release-0.13":

Applying: Add `RequeueReasonPreemptionFailed`
Using index info to reconstruct a base tree...
A	pkg/cache/queue/cluster_queue.go
M	pkg/scheduler/preemption/preemption.go
M	pkg/scheduler/preemption/preemption_hierarchical_test.go
M	pkg/scheduler/preemption/preemption_test.go
M	pkg/scheduler/scheduler.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
Auto-merging pkg/scheduler/preemption/preemption_test.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption_test.go
Auto-merging pkg/scheduler/preemption/preemption_hierarchical_test.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption_hierarchical_test.go
Auto-merging pkg/scheduler/preemption/preemption.go
CONFLICT (content): Merge conflict in pkg/scheduler/preemption/preemption.go
Auto-merging pkg/queue/cluster_queue.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Add `RequeueReasonPreemptionFailed`

Details

In response to this:

/cherrypick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot pushed a commit that referenced this pull request Nov 24, 2025
…tion(s) (#7817)

* Add `RequeueReasonPreemptionFailed`

* Add integration test + a way to inject failures

* Address a comment

* Adjust function signature

* Only set "failed" reason when sth certainly failed

* Remove one layer of options

* Fix import mismatch
k8s-ci-robot pushed a commit that referenced this pull request Nov 24, 2025
…tion(s) (#7818)

* Add `RequeueReasonPreemptionFailed`

* Add integration test + a way to inject failures

* Address a comment

* Adjust function signature

* Only set "failed" reason when sth certainly failed

* Remove one layer of options

* Fix import mismatch
gomega.Expect(err).NotTo(gomega.HaveOccurred())
}

func setupInterceptedClient() (context.Context, client.Client) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think it is nice to see this PR paves the way for simulating API server errors. We have some older code which is not tested wrt error handling because we never added the mechanism. For example #7364, but many more in the past.

Now I can refer to this PR as an example for covering error cases in integration tests. Thanks !

cc @PBundyra @pajakd .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kueue Scheduler getting stuck if preemption request fails

7 participants