device taints and tolerations (KEP 5055) by pohly · Pull Request #130447 · kubernetes/kubernetes

pohly · 2025-02-26T13:55:06Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Device taints enable DRA drivers or admins to mark device as unusable, which prevents allocating them.
Pods may also get evicted at runtime if a device becomes unusable, depending on the severity of the taint
and whether the claim tolerates the taint.

Which issue(s) this PR fixes:

Related-to: kubernetes/enhancements#5055 (initial implementation)

Special notes for your reviewer:

Based on #130120.

Does this PR introduce a user-facing change?

DRA: Device taints enable DRA drivers or admins to mark device as unusable, which prevents allocating them. Pods may also get evicted at runtime if a device becomes unusable, depending on the severity of the taint and whether the claim tolerates the taint.

k8s-ci-robot · 2025-02-26T13:55:09Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

dom4ha

I haven't not done yet, but from scheduler POV it looks good.

@dom4ha, @x13n: can you check the scheduler integration? There's a small impact on the autoscaler.

@x13n isn't a change required on the CA side as well then?

thockin · 2025-03-18T23:24:26Z

/approve for API

everpeace · 2025-03-19T07:34:27Z

+	// allocatedClaims holds all currently known allocated claims.
+	allocatedClaims map[types.NamespacedName]allocatedClaim // A value is slightly more efficient in BenchmarkTaintUntaint (less allocations!).


Cool. I think it's clever to hold evictionTime here and it can make the eviction logic simpler.

The controller is derived from the node taint eviction controller. In contrast to that controller it tracks the UID of pods to prevent deleting the wrong pod when it got replaced.

Thanks to the tracker, the plugin sees all taints directly in the device definition and can compare it against the tolerations of a request while trying to find a device for the request. When the feature is turnedd off, taints are ignored during scheduling.

Both the new DeviceTaint.TimeAdded and dropped fields when the DRADeviceTaints feature is disabled confused the ResourceSlice controller because what is stored and sent back can be different from what the controller wants to store. It's now more lenient regarding TimeAdded (doesn't need to be exact because of rounding during serialization, only having a value on the server is okay) and dropped fields (doesn't try to store them again). It also preserves a server-side TimeAdded when updating slices.

In tests it is sometimes unavoidable to use the Prometheus types directly, for example when writing a custom gatherer which needs to normalize data before testing it. device_taint_eviction_test.go does this to strip out unpredictable data in a histogram. With type aliases in a package that is explicitly meant for tests we can avoid adding exceptions for such tests to the global exception list.

pohly · 2025-03-19T08:30:29Z

/assign @MaciekPytel @gjtempleton @x13n

For SIG autoscaler contract change approval. This is the last missing approval.

The change in CA should be small (replace ResourceSliceInformer with k8s.io/dynamic-resource-controller/resourceslice/tracker, which intentionally has a similar API), but it is a change that is needed to support taints also in CA.

pohly · 2025-03-19T08:38:44Z

This is the last missing approval.

Well, only technically. An explicit approval from SIG Scheduling would be nice. @dom4ha reviewed.

/assign @sanposhiho

x13n · 2025-03-19T09:08:26Z

I see the only change in CA is a function rename, which is fine. Approving from CA point of view.

/approve

sanposhiho

/lgtm
/approve

for the sig-scheduling area. Leave one question, but not blocking one

sanposhiho · 2025-03-19T09:07:57Z

+	// Might be tainted, in which case the taint has to be tolerated.
+	// The check is skipped if the feature is disabled.
+	if alloc.deviceTaintsEnabled && !allTaintsTolerated(device.basic, request) {
+		return false, nil, nil


Not specifically about this feature though, are we not providing users with a way of knowing why each device is rejected? i.e., if a device is rejected by this feature, how could users notice that?

I think it makes sense. That can be a future improvement.

This is indeed problematic and also applies to other device selection criteria and more generally to all scheduling decisions. I discussed some ideas earlier in this PR.

Scheduling decisions are somewhat visible to users via FailedScheduling events (ref). Although we know it's not detailed enough in some cases; it doesn't show full details like which plugin returns what result/score for which node at which extension point.
The only recommendation for users who want to see such details is utilizing https://github.com/kubernetes-sigs/kube-scheduler-simulator, for now.

I think the device allocation is tied to ResourceClaim resource. So, How about exposing allocation failure (I imagined "can't tolerate" case only) information on ResourceClaim side by events or status??

There may be hundreds of devices that can't be used to allocate a ResourceClaim. We cannot list all of them and the reason in each case.

A single "cannot allocate" event would be possible, but also not very informative.

k8s-ci-robot · 2025-03-19T09:11:23Z

LGTM label has been added.

Details

Git tree hash: d39e2ab48e1017aa2dcb6b510d1a405a76ba3c52

k8s-ci-robot · 2025-03-19T09:11:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, pohly, sanposhiho, thockin, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~api/OWNERS~~ [thockin]
~~cmd/kube-controller-manager/OWNERS~~ [thockin]
~~pkg/api/OWNERS~~ [thockin]
~~pkg/apis/OWNERS~~ [thockin]
~~pkg/controller/OWNERS~~ [thockin]
~~pkg/features/OWNERS~~ [thockin]
~~pkg/generated/openapi/OWNERS~~ [thockin]
~~pkg/kubeapiserver/OWNERS~~ [thockin]
~~pkg/printers/OWNERS~~ [thockin]
~~pkg/registry/OWNERS~~ [thockin]
~~pkg/scheduler/OWNERS~~ [sanposhiho,thockin]
~~pkg/scheduler/framework/autoscaler_contract/OWNERS~~ [x13n]
~~plugin/pkg/auth/authorizer/OWNERS~~ [thockin]
~~staging/src/k8s.io/api/OWNERS~~ [thockin]
~~staging/src/k8s.io/client-go/OWNERS~~ [thockin]
~~staging/src/k8s.io/component-base/metrics/OWNERS~~ [johnbelamaric,thockin]
~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [pohly,thockin]
~~test/OWNERS~~ [pohly,thockin]
~~test/compatibility_lifecycle/reference/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dom4ha

/lgtm

pohly · 2025-03-19T09:25:24Z

/hold

I think we have all the necessary approval and reviews, but let's give everyone involved so far a chance to weigh in, in case that something was missed.

dgrisonnet · 2025-03-19T09:57:20Z

+			taintInformer.Informer().HasSynced,
+			classInformer.Informer().HasSynced,
+		},
+		metrics: metrics.Global,


What's the reason why you need to use a global? Couldn't you call metrics.New() here and then modify the Register to take Metrics as receiver?

This is the constructor called by kube-controller-manager. kube-controller-manager expects all controllers to use global metrics. I didn't want to break that contract.

For example, while unusual it wouldn't be wrong to call New multiple times for the same name as long as only one of them gets to run or the running instance is stopped.

I tried, but gave up because it also broke unit testing where New is called multiple times and the metrics names are expected to use the same name for all test cases.

pohly · 2025-03-19T17:21:33Z

I think we have all the necessary approval and reviews, but let's give everyone involved so far a chance to weigh in, in case that something was missed.

If I don't hear otherwise, then I'll lift the hold in an hour.

johnbelamaric · 2025-03-19T17:33:19Z

 	}
 }
+
+// TODO: add tests after partitionable devices is merged (code conflict!)


@pohly should we sequence this PR after partitionable merges? There are also changes in partitionable that add a features struct for the feature gates in allocator.

@cici37: you can remove this comment as part of your rebase. I already added those tests, as you will see while resolving conflicts.

I copy-and-pasted as much of your test changes as possible, you should be able to simple add your new test cases.

Thanks. Will review while rebasing^^

johnbelamaric · 2025-03-19T17:46:03Z

/hold cancel

cici37 · 2025-03-19T18:15:59Z

/retest

k8s-ci-robot · 2025-03-19T18:19:00Z

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-apidiff-client-go	`9f16159`	link	false	`/test pull-kubernetes-apidiff-client-go`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot requested review from andrewsykim and ardaguclu February 26, 2025 13:55

k8s-ci-robot added the area/apiserver label Feb 26, 2025

pohly mentioned this pull request Feb 26, 2025

DRA: device taints and tolerations kubernetes/enhancements#5055

Open

19 tasks

pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Feb 26, 2025

haircommander moved this from Triage to Archive-it in SIG Node CI/Test Board Feb 26, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2025

pohly force-pushed the dra-device-taints branch from 72b0fae to 177e3b5 Compare February 28, 2025 18:40

dom4ha reviewed Mar 18, 2025

View reviewed changes

Comment thread pkg/controller/tainteviction/namespacedobject.go Outdated

everpeace reviewed Mar 19, 2025

View reviewed changes

pohly and others added 5 commits March 19, 2025 09:18

DRA: add device taint eviction controller

a027b43

The controller is derived from the node taint eviction controller. In contrast to that controller it tracks the UID of pods to prevent deleting the wrong pod when it got replaced.

DRA E2E: tests for device taints

2499663

pohly force-pushed the dra-device-taints branch from a0b3bbd to 9f16159 Compare March 19, 2025 08:18

k8s-ci-robot assigned gjtempleton and MaciekPytel Mar 19, 2025

sanposhiho approved these changes Mar 19, 2025

View reviewed changes

dom4ha reviewed Mar 19, 2025

View reviewed changes

x13n mentioned this pull request Mar 19, 2025

CA DRA: handle device taints and tolerations (KEP-5055) kubernetes/autoscaler#7947

Open

dgrisonnet reviewed Mar 19, 2025

View reviewed changes

pohly mentioned this pull request Mar 19, 2025

Implement DRA Device Binding Conditions (KEP-5007) #130160

Merged

johnbelamaric reviewed Mar 19, 2025

View reviewed changes

cici37 mentioned this pull request Mar 19, 2025

[KEP-4815]DRA Partitionable device #130764

Merged

MenD32 mentioned this pull request May 2, 2025

Chore: Bump CA's kubernetes dependencies to v1.33.0 kubernetes/autoscaler#8086

Closed

liggitt mentioned this pull request Aug 19, 2025

DRA API: implement ResourceClaim strategy for DRADeviceTaints #132927

Merged

		// allocatedClaims holds all currently known allocated claims.
		allocatedClaims map[types.NamespacedName]allocatedClaim // A value is slightly more efficient in BenchmarkTaintUntaint (less allocations!).

Conversation

pohly commented Feb 26, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Feb 26, 2025

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thockin commented Mar 18, 2025

Uh oh!

everpeace Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pohly commented Mar 19, 2025

Uh oh!

pohly commented Mar 19, 2025

Uh oh!

x13n commented Mar 19, 2025

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

everpeace Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 19, 2025

Uh oh!

k8s-ci-robot commented Mar 19, 2025

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

pohly commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Mar 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnbelamaric commented Mar 19, 2025

Uh oh!

cici37 commented Mar 19, 2025

Uh oh!

k8s-ci-robot commented Mar 19, 2025

Uh oh!

everpeace Mar 19, 2025 •

edited

Loading

sanposhiho Mar 19, 2025 •

edited

Loading

everpeace Mar 19, 2025 •

edited

Loading

pohly commented Mar 19, 2025 •

edited

Loading