Prioritized Alternatives in Device Requests#128586
Prioritized Alternatives in Device Requests#128586k8s-ci-robot merged 10 commits intokubernetes:masterfrom
Conversation
johnbelamaric
left a comment
There was a problem hiding this comment.
super cursory quick look...
31aaf8f to
19ab559
Compare
|
/assign |
19ab559 to
e49122d
Compare
e49122d to
fa17f45
Compare
Looks good to me.
What I tend to do in such cases is fix up my branch, force-push it, then rebase and force-push again. I then point reviewers at the first force-push diff to show what I changed. |
@pohly You still have a "request changes" status on the PR. Can you update it if this looks good to you?
Thanks, that is a good suggestion. I'll try to follow that pattern in the future. |
I don't find where that "request changes" is shown. I remember that a GitHub review has a button to resolve that, but I just don't find it now. Anyway, for merging a PR it doesn't matter that much. What matters is getting at least an informal LGTM from all relevant reviewers and then someone must do the formal one and approvals. So regarding that, I think this PR is ready for merging as far as the feature implementation goes. API definitions and their implementation still need to be reviewed. |
|
/assign @dom4ha |
|
/approve |
|
/approve (trying to cast approvals on thockin's behalf) |
|
@johnbelamaric we'll still need @thockin 's |
|
also hold for enough folks to chime in. /hold |
dom4ha
left a comment
There was a problem hiding this comment.
/lgtm from the scheduler perspective
| slices: slices, | ||
| celCache: celCache, | ||
| adminAccessEnabled: adminAccessEnabled, | ||
| prioritizedListEnabled: prioritizedListEnabled, |
There was a problem hiding this comment.
Have you considered creating features structure similar to the resourceclaim.Features?
There was a problem hiding this comment.
Yeah, I am actually making this change in the PR for Partitionable Devices. I suggest we address this in that PR so we can get this one merged.
|
/cc @macsko @sanposhiho for sig-scheduling approval |
|
@dom4ha: GitHub didn't allow me to request PR reviews from the following users: for, approval. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Where's the integration test for this feature? This file got the change to enable it, but don't we need any new test case specifically?
There was a problem hiding this comment.
I'm not sure how useful such a test would be. As agreed upon in kubernetes/enhancements#5077, not everything needs all kinds of tests.
There is an E2E test, which covers the full flow (API, scheduler, etc.).
There was a problem hiding this comment.
Might be Yes for some features, but this feature could be achieved with different components. Why don't we need an integration test for it?
There was a problem hiding this comment.
Which additional code paths or scenarios would an integration test cover that isn't already covered?
There was a problem hiding this comment.
How do we make sure the full scenario works from one component to others (or even more specifically, from one struct to others within the same component)? How do we make sure one component's behavior/output matches the expectations from others?
The integration test is not for the test coverage. It's not that once unit tests cover all code path, you can skip the integration tests.
There was a problem hiding this comment.
The firstAvailable is mostly internal to the plugin. It doesn't change how the plugin interacts with the scheduler (but see below). We still need to test with an apiserver to ensure that the fields flow properly from user to apiserver to scheduler to plugin, which is covered by the E2E test.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
That shouldn't be in scheduler_perf. We can probably add it to test/integration/scheduler/plugins/plugins_test.go.
The happy path could be in scheduler_perf.
Morten is taking some time off right now. To unblock this PR, let me see whether I can do something and address #128586 (comment).
There was a problem hiding this comment.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
Unfortunately that is exactly the scenario that currently cannot be handled in an integration test: apiserver and scheduler have to be brought up in the same process and share the global default feature gate. Without modifying both components it's impossible to bring up the apiserver with the feature enabled and the scheduler without it. It's not supported to change the feature gate during a test run (explicitly prevented by featuregatetesting.SetFeatureGateDuringTest). Perhaps some custom code could do it.
I think this falls under "version skew testing" which is not required for alpha. @sanposhiho: okay to do it later?
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
There was a problem hiding this comment.
okay to do it later?
🙆♂️
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
Thanks.
| return nil, status | ||
| } | ||
| } else { | ||
| for _, subRequest := range request.FirstAvailable { |
There was a problem hiding this comment.
do we need to feature-gate this path?
There was a problem hiding this comment.
Yes. As in the allocator it should refuse to schedule pods which depend on firstAvailable.
This then raises the question: does the allocator then still need that feature gate check? It should never get called for claims involving firstAvailable when the feature is off.
In other words, move the check up one level?
There was a problem hiding this comment.
Perhaps it's better to do it in both places. The allocator might also be used outside of the scheduler.
There was a problem hiding this comment.
This is the validation path only, so the only consequence of not having the feature-gate here is that PreFilter won't fail with the complain that request.DeviceClassName is missing (when firstAvailable was present) in PreFilter. The further attempt to use firstAvailable should still fail in allocator with a different error (most likely in Filter).
There was a problem hiding this comment.
True. But it's still better to fail earlier. I've implemented that, just need to finish the integration test updates.
| return nil, fmt.Errorf("claim %s, request %s: admin access is requested, but the feature is disabled", klog.KObj(claim), request.Name) | ||
| // Error out if the prioritizedList feature is not enabled and the request | ||
| // has subrequests. This is to avoid surprising behavior for users. | ||
| if !a.prioritizedListEnabled && hasSubRequests { |
There was a problem hiding this comment.
What if users followed enable -> disable path? i.e., a user created it with FirstAvailable with the feature gate enabled, and then disable the feature gate for some reason. It looks like they must update all claims not to have FirstAvailable when they need to disable the gate.
Is there any existing behavior that FirstAvailable requests can fall back to?
There was a problem hiding this comment.
Is there any existing behavior that FirstAvailable requests can fall back to?
No. All that Kubernetes can do is to not behave badly. Ignoring the sub-request and scheduling the pod without the devices that it needs is worse than refusing to schedule with an explanation why.
As you said, the user then needs to take action and either define their workload differently or accept that the cluster cannot run it.
There was a problem hiding this comment.
Then, is there any way to make it visible to users? (instead of logs)
There was a problem hiding this comment.
Refusal to deal with the request would cause PreFilter to fail and thus would get reported to the user as a scheduling failure.
There was a problem hiding this comment.
But only assuming the comment #128586 (comment) is addressed. Without it the validation will pass in PreFilter and a pod will most likely remain unschedulable (fail in Filter), which might be acceptable as well.
There was a problem hiding this comment.
#130622 adds that, so with that PR it'll fail already in PreFilter.
There was a problem hiding this comment.
I don't know whether @mortent is available to take my commits from #130622 into his branch for this PR. I therefore prefer merging that as a follow-up.
@sanposhiho: are you okay with that? I think you can do the formal LGTM.
I believe everything else has been reviewed (@dom4ha in #128586 (review) for scheduler, Tim in #128586 (review) for the API). It also looks good to me.
| return nil, status | ||
| } | ||
| } else { | ||
| for _, subRequest := range request.FirstAvailable { |
| return nil, fmt.Errorf("claim %s, request %s: admin access is requested, but the feature is disabled", klog.KObj(claim), request.Name) | ||
| // Error out if the prioritizedList feature is not enabled and the request | ||
| // has subrequests. This is to avoid surprising behavior for users. | ||
| if !a.prioritizedListEnabled && hasSubRequests { |
There was a problem hiding this comment.
#130622 adds that, so with that PR it'll fail already in PreFilter.
There was a problem hiding this comment.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
Unfortunately that is exactly the scenario that currently cannot be handled in an integration test: apiserver and scheduler have to be brought up in the same process and share the global default feature gate. Without modifying both components it's impossible to bring up the apiserver with the feature enabled and the scheduler without it. It's not supported to change the feature gate during a test run (explicitly prevented by featuregatetesting.SetFeatureGateDuringTest). Perhaps some custom code could do it.
I think this falls under "version skew testing" which is not required for alpha. @sanposhiho: okay to do it later?
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, johnbelamaric, mortent, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
sanposhiho
left a comment
There was a problem hiding this comment.
/lgtm
/approve
sig-scheduling.
note: Some of my comments aren't addressed within this PR, but #130622 will address those. Also, those points are, either way, not super critical ones that we must include in the same PR.
|
LGTM label has been added. DetailsGit tree hash: 8942038fe5779e85d2db34003055e21edc7d9c1e |
|
/hold cancel |
Thanks @sanposhiho for your review! I've rebased #130622, let's continue there. |
What type of PR is this?
/kind feature
/kind api-change
/kind deprecation
What this PR does / why we need it:
This implements https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4816-dra-prioritized-list
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: