WIP: Implement DRA admin-controlled attributes (KEP-5027)#130120
WIP: Implement DRA admin-controlled attributes (KEP-5027)#130120nojnhuh wants to merge 8 commits intokubernetes:masterfrom
Conversation
|
I think this PR technically ticks all the boxes in the KEP, but it could use a fair bit of polish around performance optimization, error handling, and documentation. I think I'll need help with some of that, so I'll make a pass at adding some of my own commentary inline. |
| // tracker. | ||
| // | ||
| // The return value can be used to wait for cache synchronization. | ||
| func (t *Tracker) AddEventHandler(handler cache.ResourceEventHandler) cache.ResourceEventHandlerRegistration { |
There was a problem hiding this comment.
All of the AddEventHandler-adjacent things here are mostly copy-paste from the scheduler's AssumeCache:
| for _, sliceName := range t.resourceSlices.GetIndexer().ListKeys() { | ||
| t.syncSlice(ctx, sliceName) | ||
| } |
There was a problem hiding this comment.
@pohly has suggested that the biggest opportunities for performance optimization here are probably to be smarter about recalculating only exactly what might be out of date for any given event. From our discussion in Slack:
- When adding or updating a ResourceSlicePatch, check against all existing slices. Only check the effect of that patch, don't re-evaluate whether other patches now apply to the slice.
- When removing a ResourceSlicePatch, only check those slices where patching involved that patch.
Other hints and ideas are very much welcome!
Is the best place to keep a mapping between ResourceSlicePatches and affected ResourceSlices some in-memory data structure? @jackfrancis also brought up that it might be useful to ResourceSlicePatch authors for that mapping to exist in the status of a ResourceSlicePatch so they can more easily determine exactly where a given patch applies.
There was a problem hiding this comment.
I think this needs to be in memory, in a struct defined locally in this helper.
If we were to put that into the status, we run the risk that either the status becomes very large (imagine a patch that applies to all devices of a driver in a large cluster) or incomplete (when we truncate). It also would cause repeated writes.
I think a CLI tool (kubectl describe ?) with client-side logic is a better way to inform users.
| } | ||
| }, | ||
| }, | ||
| "delete-slices": { |
There was a problem hiding this comment.
Overall, these tests have turned into a bit of a mess, so ideas for cleaning these up or making them more expressive are greatly appreciated!
One goal I've been aiming for is for each test case to do some setup (creating objects, etc.), start a Tracker, wait for the informers to sync initially, then checking the results. I think that keeps things almost entirely synchronous so we can avoid potentially flaky Eventually assertions. I think that model doesn't quite fit for this test case though (which is currently failing), since I don't think there's a way for an initial informer sync to notice a deletion. Are there existing tests elsewhere that might offer some ideas to improve that?
There was a problem hiding this comment.
Many controllers don't exercise their event handlers, which is a gap. Instead, there are tests for the "sync" function because that is synchronous. But that doesn't help here.
I am facing the same issue with device taints which also has potentially complex event handlers where there are many different scenarios which only differ in the order in which certain events are received (e.g. "new slice + new slice patch" has the same outcome as "new slice patch" + "new slice"). My current thinking is to define the set of events that are independent (i.e. no ordering between them, as it exists between "add slice" + "update slice") and the outcome, then automatically run sub-tests with all possible permutations. I intend to call my event handlers directly from the test, without involving the actual informer, to make it synchronous and predictable.
I don't know yet how that will turn out.
|
/milestone v1.34 |
|
/milestone v1.33 |
371f13c to
ead0403
Compare
| return sliceHandlerReg.HasSynced() && | ||
| patchHandlerReg.HasSynced() && | ||
| classHandlerReg.HasSynced() | ||
| }) |
There was a problem hiding this comment.
The scheduler is waiting for all registered event handlers to sync before scheduling pods, doesn't it?
Have you compared startup latency with and without this feature in a scenario where there are many ResourceSlices? We don't want ResourceSlicePatches, i.e. the feature doesn't actually have to be be used, to make that a fair comparison.
There was a problem hiding this comment.
I haven't done that comparison yet, but I'll try that and let you know what I find.
There was a problem hiding this comment.
I just pushed another benchmark here to show what I tried. It compares how long it takes to sync and event handler registration between a plain informer (like the scheduler does today) and the new Tracker added in this PR. The initial state is 500 "fully loaded" ResourceSlices with 128 devices each. A plain informer syncs in about 5ms and a tracker takes about 105ms. About 99% of that extra time is spent building up the driver/pool/device index that maps a ResourceSlicePatch to a set of ResourceSlices.
These are the results from one run of the benchmark:
% go test ./staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker -run ^$$ -benchmem -bench InitialSync -count 10
goos: linux
goarch: amd64
pkg: k8s.io/dynamic-resource-allocation/resourceslice/tracker
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkInitialSync/plain-informer-16 218 5342191 ns/op 3937348 B/op 71475 allocs/op
BenchmarkInitialSync/plain-informer-16 217 5495781 ns/op 3937205 B/op 71475 allocs/op
BenchmarkInitialSync/plain-informer-16 223 5479291 ns/op 3937262 B/op 71475 allocs/op
BenchmarkInitialSync/plain-informer-16 223 5302500 ns/op 3937081 B/op 71474 allocs/op
BenchmarkInitialSync/plain-informer-16 219 5434150 ns/op 3936994 B/op 71474 allocs/op
BenchmarkInitialSync/plain-informer-16 211 5323269 ns/op 3936933 B/op 71474 allocs/op
BenchmarkInitialSync/plain-informer-16 217 5553406 ns/op 3936938 B/op 71474 allocs/op
BenchmarkInitialSync/plain-informer-16 214 5570633 ns/op 3936959 B/op 71475 allocs/op
BenchmarkInitialSync/plain-informer-16 212 5469078 ns/op 3936995 B/op 71475 allocs/op
BenchmarkInitialSync/plain-informer-16 216 5409220 ns/op 3936990 B/op 71474 allocs/op
BenchmarkInitialSync/tracker-16 10 103453848 ns/op 75046796 B/op 607840 allocs/op
BenchmarkInitialSync/tracker-16 10 100953524 ns/op 75046732 B/op 607837 allocs/op
BenchmarkInitialSync/tracker-16 10 104236449 ns/op 75046588 B/op 607839 allocs/op
BenchmarkInitialSync/tracker-16 10 105421754 ns/op 75046785 B/op 607841 allocs/op
BenchmarkInitialSync/tracker-16 10 103890954 ns/op 75047152 B/op 607842 allocs/op
BenchmarkInitialSync/tracker-16 10 106924772 ns/op 75046926 B/op 607842 allocs/op
BenchmarkInitialSync/tracker-16 10 110341134 ns/op 75047332 B/op 607849 allocs/op
BenchmarkInitialSync/tracker-16 10 104523588 ns/op 75047040 B/op 607844 allocs/op
BenchmarkInitialSync/tracker-16 10 104617215 ns/op 75046653 B/op 607839 allocs/op
BenchmarkInitialSync/tracker-16 10 106133902 ns/op 75046691 B/op 607840 allocs/op
PASS
ok k8s.io/dynamic-resource-allocation/resourceslice/tracker 59.105s
Are those the kind of comparison you were looking for?
There was a problem hiding this comment.
Yes, thanks. Performance also looks good to me.
| case framework.ResourceSlice: | ||
| if utilfeature.DefaultFeatureGate.Enabled(features.DynamicResourceAllocation) { | ||
| if handlerRegistration, err = informerFactory.Resource().V1beta1().ResourceSlices().Informer().AddEventHandler( | ||
| handlerRegistration = resourceSliceTracker.AddEventHandler( |
There was a problem hiding this comment.
We need another feature gate check here: if DRAAdminControlledDeviceAttributes is enabled, we need to use a resource slice tracker. Otherwise we need to use the normal informer.
This ensures that the feature has no negative effect when disabled because the new code isn't active.
This may effect some of the registration unit tests. We need them in flavors with and without this feature.
There was a problem hiding this comment.
The alternative is to make the tracker a thin wrapper around the resource slice informer if the feature is disabled. This may be simpler for the scheduler.
I think I prefer this approach, but it depends a bit on how obviously the tracker is not doing any extra work when just passing through to the informer.
There was a problem hiding this comment.
The alternative is to make the tracker a thin wrapper around the resource slice informer if the feature is disabled. This may be simpler for the scheduler.
I think I prefer this approach, but it depends a bit on how obviously the tracker is not doing any extra work when just passing through to the informer.
This is overall what I was aiming for, but didn't try to completely minimize the amount of extra work that's done. The tracker will still loop through all of the ResourceSlicePatches and figure out which ones apply to a given ResourceSlice even if DRAAdminControlledDeviceAttributes is disabled. That at least saves from having one place that enumerates all of the feature gates that could modify devices in a ResourceSlice in addition to checks for each individual feature. Some of the added logging may also have some non-negligible performance impact.
There was a problem hiding this comment.
We should aim for the absolute minimum amount of extra code when the feature is off. That's not so much about performance (although that is also a concern) but mostly to ensure that we are not breaking anything when the feature is off.
We also have to ensure that no informer for ResourceSlicePatch gets instantiated when the feature is off, otherwise the scheduler fails to start in a version skew scenario where the alpha API is disabled because the informer wouldn't sync.
Add initial sync latency benchmarks
|
@nojnhuh: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
thockin
left a comment
There was a problem hiding this comment.
I don't recall reading this KEP for API -- who did that?
The KEP doesn't explain why we should do this -- it feels like a LOT of complexity, and I can't see WHY we need it, and the exploration of alternatives does't illuminate much either.
I stopped my review for now because I am having a hard time rationalizing this.
| // | ||
| // +optional | ||
| // +oneOf=ValueType | ||
| NullValue *NullValue `json:"null,omitempty" protobuf:"bytes,2,opt,name=null"` |
There was a problem hiding this comment.
It seems like it might be a problem to have a field named "null" in YAML? At the very least it's confusing as heck.
Is this really the best we can do?
There was a problem hiding this comment.
This goes back to kubernetes/enhancements#5034 (comment).
We could call this field Remove, either with a bool or an empty struct as in the code right now. That was my initial idea. We then changed to "Null" to make it a bit less specific to overlays.
Just giving an empty DeviceAttribute doesn't work, because "empty" might also mean "unknown value type".
If this looks too weird and we can't find some better alternative, then let's take it out entirely. It's mostly there for the sake of completeness, not because we have specific use cases for "admin must remove some attribute published by a driver".
Let me answer your other questions here:
I don't recall reading this KEP for API -- who did that?
You can blame me for it... I'm pretty sure I called it out a few times, but in the end we had to merge it without an API review.
The KEP doesn't explain why we should do this
I'm assuming "this" is mostly about the concept of ResourceSlicePatch and on-the-fly combining that with ResourceSlice.
What you see here is one application of that concept, changing device attributes. The alternative for that is to ask all DRA developers to add their own custom APIs or configuration parameters for that - doable, but not nice. Regarding what this could be used for, @johnbelamaric had some ideas about attributes that a vendor driver doesn't know about. They depend on the cloud they run in and thus have to be added by the cloud provider or admin.
But the more interesting application of the concept is in KEP #5055 - device taints and tolerations. Tainting a device is something that admins want to do through the apiserver, for example to take a device temporarily offline or to reserve it for specific workloads.
We can't do that through fields in a ResourceSlice, because those objects are owned by the DRA driver and are not guaranteed to remain around while doing driver maintenance. A driver might decide to replace the existing ones when it starts up again, or kubelet removes them when the driver goes away. In both cases the admin intent would get lost.
If we need to prioritize, then we could also leave out "admin-controlled device attributes" and only implement "device taints and tolerations" and/or reorganize the PRs so that ResourceSlicePatch gets added first together with taints, then attributes get added on top of that.
There was a problem hiding this comment.
| // It has the exact same fields as a DeviceAttribute plus `null` as | ||
| // an additional alternative. | ||
| type NullableDeviceAttribute struct { | ||
| DeviceAttribute `json:",inline" protobuf:"bytes,1,opt,name=deviceAttribute"` |
There was a problem hiding this comment.
We should put a note in DeviceAttribute that it can never add a NullValue field because this will break.
| } | ||
|
|
||
| if loggerV := logger.V(6); loggerV.Enabled() { | ||
| loggerV.Info("ResourceSlice synced", "diff", cmp.Diff(oldPatchedObj, patchedSlice)) |
There was a problem hiding this comment.
The diff is useful when old and new slice both exist:
tracker.go:548: I0306 10:10:45.665459] ResourceSlice synced resourceslice="worker-some-driver" diff=<
&v1beta1.ResourceSlice{
TypeMeta: {},
ObjectMeta: {Name: "worker-some-driver"},
Spec: v1beta1.ResourceSliceSpec{
... // 3 identical fields
NodeSelector: nil,
AllNodes: false,
Devices: []v1beta1.Device{
{
Name: "instance",
Basic: &v1beta1.BasicDevice{
Attributes: nil,
Capacity: nil,
- Taints: nil,
+ Taints: []v1beta1.DeviceTaint{s"example.com/taint=something:NoExecute"},
},
},
{Name: "instance-no-schedule", Basic: &{Taints: {{Key: "example.com/taint", Effect: "NoSchedule"}}}},
{Name: "instance-no-execute", Basic: &{Taints: {{Key: "example.com/taint", Effect: "NoExecute", TimeAdded: s"2025-03-06 10:10:45.551070113 +0100 CET m=+0.036909841"}}}},
},
},
}
>
But when adding a slice it's less informative:
tracker.go:548: I0306 10:10:50.779806] ResourceSlice synced resourceslice="worker-some-driver" diff=<
any(
+ s"&ResourceSlice{ObjectMeta:{worker-some-driver 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:ResourceSliceSpec{Driver:some-driver,Pool:ResourcePool{Name:worker,Generation:0,ResourceSliceCount:0,},NodeName:worker,NodeSelector:ni"...,
)
>
In my device taints event handlers I am doing this:
import "github.com/google/go-cmp/cmp" //nolint:depguard // Discouraged for production use (https://github.com/kubernetes/kubernetes/issues/104821) but has no good alternative for logging.
...
// handlerLogger is specifically for logging during event handling. It may be nil
// if no such logging is desired.
eventLogger *klog.Logger
...
// Doing debug logging?
if loggerV := logger.V(6); loggerV.Enabled() {
tc.eventLogger = &loggerV
}
....
if tc.eventLogger != nil {
// This is intentionally very verbose for debugging.
tc.eventLogger.Info("ResourceClaim changed", "claimObject", name, "oldClaim", klog.Format(oldClaim), "newClaim", klog.Format(newClaim), "diff", cmp.Diff(oldClaim, newClaim))
}
klog.Format hides the pretty unreadable protobuf-generated String method and instead dumps as JSON:
device_taint_eviction.go:545: I0306 10:10:45.561039] ResourceClaim changed claimObject="default/my-pod-my-resource" oldClaim=<
null
> newClaim=<
{
"metadata": {
"name": "my-pod-my-resource",
"namespace": "default",
"creationTimestamp": null
},
"spec": {
"devices": {
"requests": [
{
"name": "req-1",
"deviceClassName": "my-resource-class",
"allocationMode": "ExactCount",
"count": 1
}
]
}
},
"status": {}
}
> diff=<
(*v1beta1.ResourceClaim)(
- nil,
+ s"&ResourceClaim{ObjectMeta:{my-pod-my-resource default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:ResourceClaimSpec{Devices:DeviceClaim{Requests:[]DeviceRequest{DeviceRequest{Name:req-1,DeviceClassName:my-resource-class,Selec"...,
)
>
There was a problem hiding this comment.
It is debatable whether new code should use go-cmp. I don't see a good alternative for logging and thus think the answer needs to be "yes" until we have such an alternative. Others may disagree...
cc @dims
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I'm not sure... maybe kubernetes/enhancements#5034 (comment) was the only interaction? I'm going to route this to @thockin in the API review board but as backlog (and add @soltysh as shadow since he's stepping into doing API reviews and left some API-directed comments on the KEP PR - https://github.com/kubernetes/enhancements/pull/5034/files#r1952796598)... I'll leave this to @pohly @johnbelamaric @thockin to untangle how to proceed with the API review bit |
|
@pohly we are no longer planning on this in 1.33, correct? Instead we are updating the device taints KEP to no longer depend on it? |
|
We already agreed to not pursue this for 1.33 - let's simply close it to avoid further confusion. /close |
|
@pohly: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Yes, #130447 is alive, kicking, and most importantly, no longer depends on this one here. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
This PR implements KEP-5027 DRA: admin-controlled device attributes
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: