Skip to content

device taints and tolerations (KEP 5055)#130447

Merged
k8s-ci-robot merged 11 commits intokubernetes:masterfrom
pohly:dra-device-taints
Mar 19, 2025
Merged

device taints and tolerations (KEP 5055)#130447
k8s-ci-robot merged 11 commits intokubernetes:masterfrom
pohly:dra-device-taints

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Feb 26, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Device taints enable DRA drivers or admins to mark device as unusable, which prevents allocating them.
Pods may also get evicted at runtime if a device becomes unusable, depending on the severity of the taint
and whether the claim tolerates the taint.

Which issue(s) this PR fixes:

Related-to: kubernetes/enhancements#5055 (initial implementation)

Special notes for your reviewer:

Based on #130120.

Does this PR introduce a user-facing change?

DRA: Device taints enable DRA drivers or admins to mark device as unusable, which prevents allocating them. Pods may also get evicted at runtime if a device becomes unusable, depending on the severity of the taint and whether the claim tolerates the taint.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 26, 2025
@k8s-ci-robot k8s-ci-robot added area/code-generation area/kubelet area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2025
@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Feb 26, 2025
@haircommander haircommander moved this from Triage to Archive-it in SIG Node CI/Test Board Feb 26, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025
@pohly pohly force-pushed the dra-device-taints branch from 72b0fae to 177e3b5 Compare February 28, 2025 18:40
Copy link
Copy Markdown
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't not done yet, but from scheduler POV it looks good.

@dom4ha, @x13n: can you check the scheduler integration? There's a small impact on the autoscaler.

@x13n isn't a change required on the CA side as well then?

Comment thread pkg/controller/tainteviction/namespacedobject.go Outdated
@thockin
Copy link
Copy Markdown
Member

thockin commented Mar 18, 2025

/approve for API

Comment on lines +104 to +105
// allocatedClaims holds all currently known allocated claims.
allocatedClaims map[types.NamespacedName]allocatedClaim // A value is slightly more efficient in BenchmarkTaintUntaint (less allocations!).
Copy link
Copy Markdown
Contributor

@everpeace everpeace Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I think it's clever to hold evictionTime here and it can make the eviction logic simpler.

pohly and others added 5 commits March 19, 2025 09:18
The controller is derived from the node taint eviction controller.
In contrast to that controller it tracks the UID of pods to prevent
deleting the wrong pod when it got replaced.
Thanks to the tracker, the plugin sees all taints directly in the device
definition and can compare it against the tolerations of a request while
trying to find a device for the request.

When the feature is turnedd off, taints are ignored during scheduling.
Both the new DeviceTaint.TimeAdded and dropped fields when
the DRADeviceTaints feature is disabled confused the ResourceSlice
controller because what is stored and sent back can be different
from what the controller wants to store.

It's now more lenient regarding TimeAdded (doesn't need to be exact because of
rounding during serialization, only having a value on the server is okay)
and dropped fields (doesn't try to store them again). It also preserves
a server-side TimeAdded when updating slices.
In tests it is sometimes unavoidable to use the Prometheus types directly,
for example when writing a custom gatherer which needs to normalize data
before testing it. device_taint_eviction_test.go does this to strip
out unpredictable data in a histogram.

With type aliases in a package that is explicitly meant for tests we
can avoid adding exceptions for such tests to the global exception list.
@pohly pohly force-pushed the dra-device-taints branch from a0b3bbd to 9f16159 Compare March 19, 2025 08:18
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 19, 2025

/assign @MaciekPytel @gjtempleton @x13n

For SIG autoscaler contract change approval. This is the last missing approval.

The change in CA should be small (replace ResourceSliceInformer with k8s.io/dynamic-resource-controller/resourceslice/tracker, which intentionally has a similar API), but it is a change that is needed to support taints also in CA.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 19, 2025

This is the last missing approval.

Well, only technically. An explicit approval from SIG Scheduling would be nice. @dom4ha reviewed.

/assign @sanposhiho

@x13n
Copy link
Copy Markdown
Member

x13n commented Mar 19, 2025

I see the only change in CA is a function rename, which is fine. Approving from CA point of view.

/approve

Copy link
Copy Markdown
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

for the sig-scheduling area. Leave one question, but not blocking one

// Might be tainted, in which case the taint has to be tolerated.
// The check is skipped if the feature is disabled.
if alloc.deviceTaintsEnabled && !allTaintsTolerated(device.basic, request) {
return false, nil, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically about this feature though, are we not providing users with a way of knowing why each device is rejected? i.e., if a device is rejected by this feature, how could users notice that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense. That can be a future improvement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed problematic and also applies to other device selection criteria and more generally to all scheduling decisions. I discussed some ideas earlier in this PR.

Copy link
Copy Markdown
Member

@sanposhiho sanposhiho Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling decisions are somewhat visible to users via FailedScheduling events (ref). Although we know it's not detailed enough in some cases; it doesn't show full details like which plugin returns what result/score for which node at which extension point.
The only recommendation for users who want to see such details is utilizing https://github.com/kubernetes-sigs/kube-scheduler-simulator, for now.

Copy link
Copy Markdown
Contributor

@everpeace everpeace Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the device allocation is tied to ResourceClaim resource. So, How about exposing allocation failure (I imagined "can't tolerate" case only) information on ResourceClaim side by events or status??

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be hundreds of devices that can't be used to allocate a ResourceClaim. We cannot list all of them and the reason in each case.

A single "cannot allocate" event would be possible, but also not very informative.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: d39e2ab48e1017aa2dcb6b510d1a405a76ba3c52

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, pohly, sanposhiho, thockin, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 19, 2025

/hold

I think we have all the necessary approval and reviews, but let's give everyone involved so far a chance to weigh in, in case that something was missed.

taintInformer.Informer().HasSynced,
classInformer.Informer().HasSynced,
},
metrics: metrics.Global,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason why you need to use a global? Couldn't you call metrics.New() here and then modify the Register to take Metrics as receiver?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the constructor called by kube-controller-manager. kube-controller-manager expects all controllers to use global metrics. I didn't want to break that contract.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, while unusual it wouldn't be wrong to call New multiple times for the same name as long as only one of them gets to run or the running instance is stopped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but gave up because it also broke unit testing where New is called multiple times and the metrics names are expected to use the same name for all test cases.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 19, 2025

I think we have all the necessary approval and reviews, but let's give everyone involved so far a chance to weigh in, in case that something was missed.

If I don't hear otherwise, then I'll lift the hold in an hour.

}
}

// TODO: add tests after partitionable devices is merged (code conflict!)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly should we sequence this PR after partitionable merges? There are also changes in partitionable that add a features struct for the feature gates in allocator.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cici37: you can remove this comment as part of your rebase. I already added those tests, as you will see while resolving conflicts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy-and-pasted as much of your test changes as possible, you should be able to simple add your new test cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will review while rebasing^^

@johnbelamaric
Copy link
Copy Markdown
Member

/hold cancel

@cici37
Copy link
Copy Markdown
Contributor

cici37 commented Mar 19, 2025

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-apidiff-client-go 9f16159 link false /test pull-kubernetes-apidiff-client-go

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/code-generation area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: API review completed, 1.33
Status: ✅ Done
Archived in project
Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.