-
Notifications
You must be signed in to change notification settings - Fork 507
Unblock scheduler after failed preemption(s) #7665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @olekzabl. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
@gabesaba @mbobrovskyi May I ask for a review at this point? While this is still technically a draft (due to non-perfect test coverage), I'd like to verify my testing approach with you before I continue work on that. And the code written so far I've already tried to "polish" reasonably. Please see the PR description for more context. |
|
/test |
|
@olekzabl: The Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test all |
|
/retest |
|
/retest |
| if failed != 0 { | ||
| t.Errorf("Reported %d failed preemptions, want 0", failed) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered adding unit test(s) for this change (here or in pkg/scheduler/scheduler_test.go)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered - but (as mentioned in description of this PR) I started with smaller coverage due to reported urgency of this PR.
Since you ask about it, I opened a follow-up issue: #7806
| return fallThrough, nil | ||
| } | ||
| // Ignore patches triggered by util.FinishEvictionForWorkloads() | ||
| if cond := apimeta.FindStatusCondition(wl.Status.Conditions, kueue.WorkloadQuotaReserved); cond != nil && cond.Status == metav1.ConditionFalse && cond.Message == "By test" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if someone changes that "By test" message, we might start getting failures here? Perhaps add a constant to test/util/constants.go with this message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it - but then I noticed lots of "by test" literals spread across our tests.
So I concluded adding such constant would go a bit against our current conventions.
|
Left a few non-blocking comments but overall lgtm |
| } | ||
|
|
||
| func (f *Framework) StartManager(ctx context.Context, cfg *rest.Config, managerSetup ManagerSetup) { | ||
| func (f *Framework) StartManager(ctx context.Context, cfg *rest.Config, managerSetup ManagerSetup, opts ...ManagerSetupOption) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i see the issue.
Well then as I stated - I think updating mgrOptions directly instead of using managerSetupOptions removes unnecessary busywork and the cost of direct access to options seems negligible given we are talking about tests.
But leaving as is is also okay - leaving it up to you to decide.
|
Nice, thanks! |
|
LGTM label has been added. DetailsGit tree hash: d3284f65ff732b0c8d8308bb31a8f4bf98bf36ad |
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks quite good 👍
/approve
/cherrypick release-0.14
/cherrypick release-0.13
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, olekzabl, Singularity23x0 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@olekzabl please prepare CPs manually in case the automated cherrypick fails |
Hmm, I don't see any activity from k8s-infra-cherrypick-robot following your earlier cherrypick slash-commands. No idea why. I'll retry these commands below, though not sure if I have permissions for this. |
|
/cherrypick release-0.14 |
|
@olekzabl: #7665 failed to apply on top of branch "release-0.14": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherrypick release-0.13 |
|
@olekzabl: #7665 failed to apply on top of branch "release-0.13": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…tion(s) (#7817) * Add `RequeueReasonPreemptionFailed` * Add integration test + a way to inject failures * Address a comment * Adjust function signature * Only set "failed" reason when sth certainly failed * Remove one layer of options * Fix import mismatch
…tion(s) (#7818) * Add `RequeueReasonPreemptionFailed` * Add integration test + a way to inject failures * Address a comment * Adjust function signature * Only set "failed" reason when sth certainly failed * Remove one layer of options * Fix import mismatch
| gomega.Expect(err).NotTo(gomega.HaveOccurred()) | ||
| } | ||
|
|
||
| func setupInterceptedClient() (context.Context, client.Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think it is nice to see this PR paves the way for simulating API server errors. We have some older code which is not tested wrt error handling because we never added the mechanism. For example #7364, but many more in the past.
Now I can refer to this PR as an example for covering error cases in integration tests. Thanks !
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #5590
Special notes for your reviewer:
This PR is currently split into 2 commits:
The first commit is small, and fixes the issue.
(In short, we're roughly aligned with @mimowo that it does.
For details, see the discussion in Kueue Scheduler getting stuck if preemption request fails #5590, starting from this comment)
The second commit is my attempt to give this fix any reasonable test coverage.
I chose integration tests (to verify the whole path of "preemption error -> retry -> desired outcome").
However, for that, I needed selectively injecting fake errors into K8s client, in integration tests, which AFAICS has been never done in Kueue.
Hence, I needed some custom tweaking of the test setup to achieve that.
I'm curious about reviewers' opinions whether this is nice enough.
The test coverage may be not yet as full as I'd wish (for example, I haven't cared for testing the "preemption failed" vs. "sticky workloads" interaction); yet, as I'm hearing this issue gained some urgency, I'm un-drafting this PR now, hoping that it may be good to go as it is. Then, further tests may come in a follow-up PR. (Tracked in issue #7806).
Does this PR introduce a user-facing change?