device taints and tolerations (KEP 5055)#130447
device taints and tolerations (KEP 5055)#130447k8s-ci-robot merged 11 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
72b0fae to
177e3b5
Compare
|
/approve for API |
| // allocatedClaims holds all currently known allocated claims. | ||
| allocatedClaims map[types.NamespacedName]allocatedClaim // A value is slightly more efficient in BenchmarkTaintUntaint (less allocations!). |
There was a problem hiding this comment.
Cool. I think it's clever to hold evictionTime here and it can make the eviction logic simpler.
The controller is derived from the node taint eviction controller. In contrast to that controller it tracks the UID of pods to prevent deleting the wrong pod when it got replaced.
Thanks to the tracker, the plugin sees all taints directly in the device definition and can compare it against the tolerations of a request while trying to find a device for the request. When the feature is turnedd off, taints are ignored during scheduling.
Both the new DeviceTaint.TimeAdded and dropped fields when the DRADeviceTaints feature is disabled confused the ResourceSlice controller because what is stored and sent back can be different from what the controller wants to store. It's now more lenient regarding TimeAdded (doesn't need to be exact because of rounding during serialization, only having a value on the server is okay) and dropped fields (doesn't try to store them again). It also preserves a server-side TimeAdded when updating slices.
In tests it is sometimes unavoidable to use the Prometheus types directly, for example when writing a custom gatherer which needs to normalize data before testing it. device_taint_eviction_test.go does this to strip out unpredictable data in a histogram. With type aliases in a package that is explicitly meant for tests we can avoid adding exceptions for such tests to the global exception list.
a0b3bbd to
9f16159
Compare
|
/assign @MaciekPytel @gjtempleton @x13n For SIG autoscaler contract change approval. This is the last missing approval. The change in CA should be small (replace ResourceSliceInformer with k8s.io/dynamic-resource-controller/resourceslice/tracker, which intentionally has a similar API), but it is a change that is needed to support taints also in CA. |
Well, only technically. An explicit approval from SIG Scheduling would be nice. @dom4ha reviewed. /assign @sanposhiho |
|
I see the only change in CA is a function rename, which is fine. Approving from CA point of view. /approve |
sanposhiho
left a comment
There was a problem hiding this comment.
/lgtm
/approve
for the sig-scheduling area. Leave one question, but not blocking one
| // Might be tainted, in which case the taint has to be tolerated. | ||
| // The check is skipped if the feature is disabled. | ||
| if alloc.deviceTaintsEnabled && !allTaintsTolerated(device.basic, request) { | ||
| return false, nil, nil |
There was a problem hiding this comment.
Not specifically about this feature though, are we not providing users with a way of knowing why each device is rejected? i.e., if a device is rejected by this feature, how could users notice that?
There was a problem hiding this comment.
I think it makes sense. That can be a future improvement.
There was a problem hiding this comment.
This is indeed problematic and also applies to other device selection criteria and more generally to all scheduling decisions. I discussed some ideas earlier in this PR.
There was a problem hiding this comment.
Scheduling decisions are somewhat visible to users via FailedScheduling events (ref). Although we know it's not detailed enough in some cases; it doesn't show full details like which plugin returns what result/score for which node at which extension point.
The only recommendation for users who want to see such details is utilizing https://github.com/kubernetes-sigs/kube-scheduler-simulator, for now.
There was a problem hiding this comment.
I think the device allocation is tied to ResourceClaim resource. So, How about exposing allocation failure (I imagined "can't tolerate" case only) information on ResourceClaim side by events or status??
There was a problem hiding this comment.
There may be hundreds of devices that can't be used to allocate a ResourceClaim. We cannot list all of them and the reason in each case.
A single "cannot allocate" event would be possible, but also not very informative.
|
LGTM label has been added. DetailsGit tree hash: d39e2ab48e1017aa2dcb6b510d1a405a76ba3c52 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric, pohly, sanposhiho, thockin, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold I think we have all the necessary approval and reviews, but let's give everyone involved so far a chance to weigh in, in case that something was missed. |
| taintInformer.Informer().HasSynced, | ||
| classInformer.Informer().HasSynced, | ||
| }, | ||
| metrics: metrics.Global, |
There was a problem hiding this comment.
What's the reason why you need to use a global? Couldn't you call metrics.New() here and then modify the Register to take Metrics as receiver?
There was a problem hiding this comment.
This is the constructor called by kube-controller-manager. kube-controller-manager expects all controllers to use global metrics. I didn't want to break that contract.
There was a problem hiding this comment.
For example, while unusual it wouldn't be wrong to call New multiple times for the same name as long as only one of them gets to run or the running instance is stopped.
There was a problem hiding this comment.
I tried, but gave up because it also broke unit testing where New is called multiple times and the metrics names are expected to use the same name for all test cases.
If I don't hear otherwise, then I'll lift the hold in an hour. |
| } | ||
| } | ||
|
|
||
| // TODO: add tests after partitionable devices is merged (code conflict!) |
There was a problem hiding this comment.
@pohly should we sequence this PR after partitionable merges? There are also changes in partitionable that add a features struct for the feature gates in allocator.
There was a problem hiding this comment.
@cici37: you can remove this comment as part of your rebase. I already added those tests, as you will see while resolving conflicts.
There was a problem hiding this comment.
I copy-and-pasted as much of your test changes as possible, you should be able to simple add your new test cases.
There was a problem hiding this comment.
Thanks. Will review while rebasing^^
|
/hold cancel |
|
/retest |
|
@pohly: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Device taints enable DRA drivers or admins to mark device as unusable, which prevents allocating them.
Pods may also get evicted at runtime if a device becomes unusable, depending on the severity of the taint
and whether the claim tolerates the taint.
Which issue(s) this PR fixes:
Related-to: kubernetes/enhancements#5055 (initial implementation)
Special notes for your reviewer:
Based on #130120.
Does this PR introduce a user-facing change?