WIP: Implement DRA admin-controlled attributes (KEP-5027) by nojnhuh · Pull Request #130120 · kubernetes/kubernetes

nojnhuh · 2025-02-12T19:49:59Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

This PR implements KEP-5027 DRA: admin-controlled device attributes

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add new ResourceSlicePatch API which can augment devices defined by ResourceSlices.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/5027

nojnhuh · 2025-02-12T19:50:12Z

I think this PR technically ticks all the boxes in the KEP, but it could use a fair bit of polish around performance optimization, error handling, and documentation. I think I'll need help with some of that, so I'll make a pass at adding some of my own commentary inline.

nojnhuh · 2025-02-12T20:00:10Z

+// tracker.
+//
+// The return value can be used to wait for cache synchronization.
+func (t *Tracker) AddEventHandler(handler cache.ResourceEventHandler) cache.ResourceEventHandlerRegistration {


All of the AddEventHandler-adjacent things here are mostly copy-paste from the scheduler's AssumeCache:

kubernetes/pkg/scheduler/util/assumecache/assume_cache.go

Line 488 in be80fec

func (c *AssumeCache) AddEventHandler(handler cache.ResourceEventHandler) cache.ResourceEventHandlerRegistration {

nojnhuh · 2025-02-12T20:14:44Z

+			for _, sliceName := range t.resourceSlices.GetIndexer().ListKeys() {
+				t.syncSlice(ctx, sliceName)
+			}


@pohly has suggested that the biggest opportunities for performance optimization here are probably to be smarter about recalculating only exactly what might be out of date for any given event. From our discussion in Slack:

When adding or updating a ResourceSlicePatch, check against all existing slices. Only check the effect of that patch, don't re-evaluate whether other patches now apply to the slice.

When removing a ResourceSlicePatch, only check those slices where patching involved that patch.

Other hints and ideas are very much welcome!

Is the best place to keep a mapping between ResourceSlicePatches and affected ResourceSlices some in-memory data structure? @jackfrancis also brought up that it might be useful to ResourceSlicePatch authors for that mapping to exist in the status of a ResourceSlicePatch so they can more easily determine exactly where a given patch applies.

I think this needs to be in memory, in a struct defined locally in this helper.

If we were to put that into the status, we run the risk that either the status becomes very large (imagine a patch that applies to all devices of a driver in a large cluster) or incomplete (when we truncate). It also would cause repeated writes.

I think a CLI tool (kubectl describe ?) with client-side logic is a better way to inform users.

nojnhuh · 2025-02-12T20:25:04Z

+				}
+			},
+		},
+		"delete-slices": {


Overall, these tests have turned into a bit of a mess, so ideas for cleaning these up or making them more expressive are greatly appreciated!

One goal I've been aiming for is for each test case to do some setup (creating objects, etc.), start a Tracker, wait for the informers to sync initially, then checking the results. I think that keeps things almost entirely synchronous so we can avoid potentially flaky Eventually assertions. I think that model doesn't quite fit for this test case though (which is currently failing), since I don't think there's a way for an initial informer sync to notice a deletion. Are there existing tests elsewhere that might offer some ideas to improve that?

Many controllers don't exercise their event handlers, which is a gap. Instead, there are tests for the "sync" function because that is synchronous. But that doesn't help here.

I am facing the same issue with device taints which also has potentially complex event handlers where there are many different scenarios which only differ in the order in which certain events are received (e.g. "new slice + new slice patch" has the same outcome as "new slice patch" + "new slice"). My current thinking is to define the set of events that are independent (i.e. no ordering between them, as it exists between "add slice" + "update slice") and the outcome, then automatically run sub-tests with all possible permutations. I intend to call my event handlers directly from the test, without involving the actual informer, to make it synchronous and predictable.

I don't know yet how that will turn out.

pohly · 2025-02-19T08:40:53Z

/milestone v1.34

pohly · 2025-02-19T08:43:18Z

/milestone v1.33

pohly · 2025-03-05T08:10:18Z

+		return sliceHandlerReg.HasSynced() &&
+			patchHandlerReg.HasSynced() &&
+			classHandlerReg.HasSynced()
+	})


The scheduler is waiting for all registered event handlers to sync before scheduling pods, doesn't it?

Have you compared startup latency with and without this feature in a scenario where there are many ResourceSlices? We don't want ResourceSlicePatches, i.e. the feature doesn't actually have to be be used, to make that a fair comparison.

I haven't done that comparison yet, but I'll try that and let you know what I find.

I just pushed another benchmark here to show what I tried. It compares how long it takes to sync and event handler registration between a plain informer (like the scheduler does today) and the new Tracker added in this PR. The initial state is 500 "fully loaded" ResourceSlices with 128 devices each. A plain informer syncs in about 5ms and a tracker takes about 105ms. About 99% of that extra time is spent building up the driver/pool/device index that maps a ResourceSlicePatch to a set of ResourceSlices.

These are the results from one run of the benchmark:

% go test ./staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker -run ^$$ -benchmem -bench InitialSync -count 10 goos: linux goarch: amd64 pkg: k8s.io/dynamic-resource-allocation/resourceslice/tracker cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz BenchmarkInitialSync/plain-informer-16 218 5342191 ns/op 3937348 B/op 71475 allocs/op BenchmarkInitialSync/plain-informer-16 217 5495781 ns/op 3937205 B/op 71475 allocs/op BenchmarkInitialSync/plain-informer-16 223 5479291 ns/op 3937262 B/op 71475 allocs/op BenchmarkInitialSync/plain-informer-16 223 5302500 ns/op 3937081 B/op 71474 allocs/op BenchmarkInitialSync/plain-informer-16 219 5434150 ns/op 3936994 B/op 71474 allocs/op BenchmarkInitialSync/plain-informer-16 211 5323269 ns/op 3936933 B/op 71474 allocs/op BenchmarkInitialSync/plain-informer-16 217 5553406 ns/op 3936938 B/op 71474 allocs/op BenchmarkInitialSync/plain-informer-16 214 5570633 ns/op 3936959 B/op 71475 allocs/op BenchmarkInitialSync/plain-informer-16 212 5469078 ns/op 3936995 B/op 71475 allocs/op BenchmarkInitialSync/plain-informer-16 216 5409220 ns/op 3936990 B/op 71474 allocs/op BenchmarkInitialSync/tracker-16 10 103453848 ns/op 75046796 B/op 607840 allocs/op BenchmarkInitialSync/tracker-16 10 100953524 ns/op 75046732 B/op 607837 allocs/op BenchmarkInitialSync/tracker-16 10 104236449 ns/op 75046588 B/op 607839 allocs/op BenchmarkInitialSync/tracker-16 10 105421754 ns/op 75046785 B/op 607841 allocs/op BenchmarkInitialSync/tracker-16 10 103890954 ns/op 75047152 B/op 607842 allocs/op BenchmarkInitialSync/tracker-16 10 106924772 ns/op 75046926 B/op 607842 allocs/op BenchmarkInitialSync/tracker-16 10 110341134 ns/op 75047332 B/op 607849 allocs/op BenchmarkInitialSync/tracker-16 10 104523588 ns/op 75047040 B/op 607844 allocs/op BenchmarkInitialSync/tracker-16 10 104617215 ns/op 75046653 B/op 607839 allocs/op BenchmarkInitialSync/tracker-16 10 106133902 ns/op 75046691 B/op 607840 allocs/op PASS ok k8s.io/dynamic-resource-allocation/resourceslice/tracker 59.105s

Are those the kind of comparison you were looking for?

Yes, thanks. Performance also looks good to me.

pohly · 2025-03-05T08:13:16Z

 		case framework.ResourceSlice:
 			if utilfeature.DefaultFeatureGate.Enabled(features.DynamicResourceAllocation) {
-				if handlerRegistration, err = informerFactory.Resource().V1beta1().ResourceSlices().Informer().AddEventHandler(
+				handlerRegistration = resourceSliceTracker.AddEventHandler(


We need another feature gate check here: if DRAAdminControlledDeviceAttributes is enabled, we need to use a resource slice tracker. Otherwise we need to use the normal informer.

This ensures that the feature has no negative effect when disabled because the new code isn't active.

This may effect some of the registration unit tests. We need them in flavors with and without this feature.

The alternative is to make the tracker a thin wrapper around the resource slice informer if the feature is disabled. This may be simpler for the scheduler.

I think I prefer this approach, but it depends a bit on how obviously the tracker is not doing any extra work when just passing through to the informer.

The alternative is to make the tracker a thin wrapper around the resource slice informer if the feature is disabled. This may be simpler for the scheduler.

I think I prefer this approach, but it depends a bit on how obviously the tracker is not doing any extra work when just passing through to the informer.

This is overall what I was aiming for, but didn't try to completely minimize the amount of extra work that's done. The tracker will still loop through all of the ResourceSlicePatches and figure out which ones apply to a given ResourceSlice even if DRAAdminControlledDeviceAttributes is disabled. That at least saves from having one place that enumerates all of the feature gates that could modify devices in a ResourceSlice in addition to checks for each individual feature. Some of the added logging may also have some non-negligible performance impact.

We should aim for the absolute minimum amount of extra code when the feature is off. That's not so much about performance (although that is also a concern) but mostly to ensure that we are not breaking anything when the feature is off.

We also have to ensure that no informer for ResourceSlicePatch gets instantiated when the feature is off, otherwise the scheduler fails to start in a version skew scenario where the alpha API is disabled because the informer wouldn't sync.

Add initial sync latency benchmarks

k8s-ci-robot · 2025-03-05T21:42:35Z

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-apidiff-client-go	`ed19dbc`	link	false	`/test pull-kubernetes-apidiff-client-go`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

thockin

I don't recall reading this KEP for API -- who did that?

The KEP doesn't explain why we should do this -- it feels like a LOT of complexity, and I can't see WHY we need it, and the exploration of alternatives does't illuminate much either.

I stopped my review for now because I am having a hard time rationalizing this.

thockin · 2025-03-06T00:44:25Z

+	//
+	// +optional
+	// +oneOf=ValueType
+	NullValue *NullValue `json:"null,omitempty" protobuf:"bytes,2,opt,name=null"`


It seems like it might be a problem to have a field named "null" in YAML? At the very least it's confusing as heck.

Is this really the best we can do?

This goes back to kubernetes/enhancements#5034 (comment).

We could call this field Remove, either with a bool or an empty struct as in the code right now. That was my initial idea. We then changed to "Null" to make it a bit less specific to overlays.

Just giving an empty DeviceAttribute doesn't work, because "empty" might also mean "unknown value type".

If this looks too weird and we can't find some better alternative, then let's take it out entirely. It's mostly there for the sake of completeness, not because we have specific use cases for "admin must remove some attribute published by a driver".

Let me answer your other questions here:

I don't recall reading this KEP for API -- who did that?

You can blame me for it... I'm pretty sure I called it out a few times, but in the end we had to merge it without an API review.

The KEP doesn't explain why we should do this

I'm assuming "this" is mostly about the concept of ResourceSlicePatch and on-the-fly combining that with ResourceSlice.

What you see here is one application of that concept, changing device attributes. The alternative for that is to ask all DRA developers to add their own custom APIs or configuration parameters for that - doable, but not nice. Regarding what this could be used for, @johnbelamaric had some ideas about attributes that a vendor driver doesn't know about. They depend on the cloud they run in and thus have to be added by the cloud provider or admin.

But the more interesting application of the concept is in KEP #5055 - device taints and tolerations. Tainting a device is something that admins want to do through the apiserver, for example to take a device temporarily offline or to reserve it for specific workloads.

We can't do that through fields in a ResourceSlice, because those objects are owned by the DRA driver and are not guaranteed to remain around while doing driver maintenance. A driver might decide to replace the existing ones when it starts up again, or kubelet removes them when the driver goes away. In both cases the admin intent would get lost.

If we need to prioritize, then we could also leave out "admin-controlled device attributes" and only implement "device taints and tolerations" and/or reorganize the PRs so that ResourceSlicePatch gets added first together with taints, then attributes get added on top of that.

As discussed in a call and on Slack, let's put this PR on hold for now. The main focus needs to be on device tainting, for which I have #130447 open. It's not ready for review yet, just linking for reference.

/hold
/retitle WIP: Implement DRA admin-controlled attributes (KEP-5027)

thockin · 2025-03-06T00:45:23Z

+// It has the exact same fields as a DeviceAttribute plus `null` as
+// an additional alternative.
+type NullableDeviceAttribute struct {
+	DeviceAttribute `json:",inline" protobuf:"bytes,1,opt,name=deviceAttribute"`


We should put a note in DeviceAttribute that it can never add a NullValue field because this will break.

pohly · 2025-03-06T09:20:26Z

+	}
+
+	if loggerV := logger.V(6); loggerV.Enabled() {
+		loggerV.Info("ResourceSlice synced", "diff", cmp.Diff(oldPatchedObj, patchedSlice))


The diff is useful when old and new slice both exist:

tracker.go:548: I0306 10:10:45.665459] ResourceSlice synced resourceslice="worker-some-driver" diff=< &v1beta1.ResourceSlice{ TypeMeta: {}, ObjectMeta: {Name: "worker-some-driver"}, Spec: v1beta1.ResourceSliceSpec{ ... // 3 identical fields NodeSelector: nil, AllNodes: false, Devices: []v1beta1.Device{ { Name: "instance", Basic: &v1beta1.BasicDevice{ Attributes: nil, Capacity: nil, - Taints: nil, + Taints: []v1beta1.DeviceTaint{s"example.com/taint=something:NoExecute"}, }, }, {Name: "instance-no-schedule", Basic: &{Taints: {{Key: "example.com/taint", Effect: "NoSchedule"}}}}, {Name: "instance-no-execute", Basic: &{Taints: {{Key: "example.com/taint", Effect: "NoExecute", TimeAdded: s"2025-03-06 10:10:45.551070113 +0100 CET m=+0.036909841"}}}}, }, }, } >

But when adding a slice it's less informative:

tracker.go:548: I0306 10:10:50.779806] ResourceSlice synced resourceslice="worker-some-driver" diff=< any( + s"&ResourceSlice{ObjectMeta:{worker-some-driver 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:ResourceSliceSpec{Driver:some-driver,Pool:ResourcePool{Name:worker,Generation:0,ResourceSliceCount:0,},NodeName:worker,NodeSelector:ni"..., ) >

In my device taints event handlers I am doing this:

import "github.com/google/go-cmp/cmp" //nolint:depguard // Discouraged for production use (https://github.com/kubernetes/kubernetes/issues/104821) but has no good alternative for logging. ... // handlerLogger is specifically for logging during event handling. It may be nil // if no such logging is desired. eventLogger *klog.Logger ... // Doing debug logging? if loggerV := logger.V(6); loggerV.Enabled() { tc.eventLogger = &loggerV } .... if tc.eventLogger != nil { // This is intentionally very verbose for debugging. tc.eventLogger.Info("ResourceClaim changed", "claimObject", name, "oldClaim", klog.Format(oldClaim), "newClaim", klog.Format(newClaim), "diff", cmp.Diff(oldClaim, newClaim)) }

klog.Format hides the pretty unreadable protobuf-generated String method and instead dumps as JSON:

device_taint_eviction.go:545: I0306 10:10:45.561039] ResourceClaim changed claimObject="default/my-pod-my-resource" oldClaim=< null > newClaim=< { "metadata": { "name": "my-pod-my-resource", "namespace": "default", "creationTimestamp": null }, "spec": { "devices": { "requests": [ { "name": "req-1", "deviceClassName": "my-resource-class", "allocationMode": "ExactCount", "count": 1 } ] } }, "status": {} } > diff=< (*v1beta1.ResourceClaim)( - nil, + s"&ResourceClaim{ObjectMeta:{my-pod-my-resource default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:ResourceClaimSpec{Devices:DeviceClaim{Requests:[]DeviceRequest{DeviceRequest{Name:req-1,DeviceClassName:my-resource-class,Selec"..., ) >

It is debatable whether new code should use go-cmp. I don't see a good alternative for logging and thus think the answer needs to be "yes" until we have such an alternative. Others may disagree...

cc @dims

k8s-ci-robot · 2025-03-08T09:40:10Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

liggitt · 2025-03-13T16:26:38Z

I don't recall reading this KEP for API -- who did that?

I'm not sure... maybe kubernetes/enhancements#5034 (comment) was the only interaction?

I'm going to route this to @thockin in the API review board but as backlog (and add @soltysh as shadow since he's stepping into doing API reviews and left some API-directed comments on the KEP PR - https://github.com/kubernetes/enhancements/pull/5034/files#r1952796598)... I'll leave this to @pohly @johnbelamaric @thockin to untangle how to proceed with the API review bit

johnbelamaric · 2025-03-13T16:41:46Z

@pohly we are no longer planning on this in 1.33, correct? Instead we are updating the device taints KEP to no longer depend on it?

pohly · 2025-03-13T16:42:54Z

We already agreed to not pursue this for 1.33 - let's simply close it to avoid further confusion.

/close

k8s-ci-robot · 2025-03-13T16:42:59Z

@pohly: Closed this PR.

Details

In response to this:

We already agreed to not pursue this for 1.33 - let's simply close it to avoid further confusion.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-03-13T16:43:50Z

Yes, #130447 is alive, kicking, and most importantly, no longer depends on this one here.

k8s-ci-robot requested review from apelisse and bart0sh February 12, 2025 19:50

nojnhuh commented Feb 12, 2025

View reviewed changes

nojnhuh commented Feb 13, 2025

View reviewed changes

Comment thread staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go Outdated

hashim21223445 approved these changes Feb 14, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 17, 2025

k8s-ci-robot added this to the v1.34 milestone Feb 19, 2025

nojnhuh added 4 commits February 27, 2025 15:17

Add ResourceSlice tracker

13656a6

plug resource slice tracker events into scheduler

6138bdb

Add scheduler_perf tests

5956885

Add e2e test

ead0403

nojnhuh force-pushed the dra-admin-attr branch from 371f13c to ead0403 Compare February 27, 2025 21:30

k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 27, 2025

pohly reviewed Mar 5, 2025

View reviewed changes

fixup! Add ResourceSlice tracker

ed19dbc

Add initial sync latency benchmarks

thockin reviewed Mar 6, 2025

View reviewed changes

pohly reviewed Mar 6, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2025

k8s-ci-robot changed the title ~~Implement DRA admin-controlled attributes (KEP-5027)~~ WIP: Implement DRA admin-controlled attributes (KEP-5027) Mar 7, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2025

nojnhuh moved this from 👀 In review to 🏗 In progress in Dynamic Resource Allocation Mar 7, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2025

liggitt assigned liggitt and unassigned liggitt Mar 13, 2025

k8s-ci-robot closed this Mar 13, 2025

github-project-automation Bot moved this from Needs Triage to Closed / Done in SIG Auth Mar 13, 2025

github-project-automation Bot moved this from Tracked to Done in [sig-release] Bug Triage Mar 13, 2025

github-project-automation Bot moved this from Archive-it to Done in SIG Node CI/Test Board Mar 13, 2025

pohly moved this from 🏗 In progress to 📋 Backlog in Dynamic Resource Allocation Mar 20, 2025

pohly moved this from 📋 Backlog to ✅ Done in Dynamic Resource Allocation Sep 12, 2025

Conversation

nojnhuh commented Feb 12, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

nojnhuh commented Feb 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pohly commented Feb 19, 2025

Uh oh!

pohly commented Feb 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 5, 2025

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 8, 2025

Uh oh!

liggitt commented Mar 13, 2025

Uh oh!

johnbelamaric commented Mar 13, 2025

Uh oh!

pohly commented Mar 13, 2025

Uh oh!

k8s-ci-robot commented Mar 13, 2025

Uh oh!

pohly commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants