Simplify device manager: make endpoint stateless by figo · Pull Request #65948 · kubernetes/kubernetes

figo · 2018-07-08T02:44:15Z

While reviewing devicemanager code, found the caching layer on endpoint is redundant.
Here are the 3 related objects in picture:

devicemanager <-> endpoint <-> plugin

plugin is the source of truth for devices and device health status.
devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint to cache devices, this patch is removing the cache layer,
endpoint becomes stateless, which i believe should be the case (but i do welcome review
if i missed something here).

also removing the Manager.Devices() since i didn't find any caller of this other than test.

if we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, so don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.

What this PR does / why we need it:
this patch simplify the device manager code, make it more maintainable.

Which issue(s) this PR fixes *:
this is a refactor of device manager code

Special notes for your reviewer:
will need to rebase the code if #58755 get checked-in first.

Release note:

None

/sig node
/cc @jiayingz @RenaudWasTaken @vishh @saad-ali @vikaschoudhary16 @vladimirvivien @anfernee

RenaudWasTaken · 2018-07-08T10:59:18Z

/ok-to-test

…

On Sun, Jul 8, 2018, 04:46 k8s-ci-robot ***@***.***> wrote: [APPROVALNOTIFIER] This PR is *NOT APPROVED* This pull-request has been approved by: *figo <#65948#>* To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: *vishh* Assign the PR to them by writing /assign @vishh in a comment when ready. The full list of commands accepted by this bot can be found here <https://go.k8s.io/bot-commands>. The pull request process is described here <https://git.k8s.io/community/contributors/guide/owners.md#the-code-review-process> Needs approval from an approver in each of these files: - *pkg/kubelet/cm/devicemanager/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/OWNERS>* Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#65948 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACZyE35j_fde7CLXDWysAIbmdfPe8zKBks5uEXKPgaJpZM4VGkOR> .

neolit123 · 2018-07-08T20:13:59Z

@kubernetes/sig-node-pr-reviews

@figo
does this need a release note and/or docs updates?

figo · 2018-07-08T22:34:11Z

@neolit123 thanks, this is something transparent to user, so i guess don't have to add to release note

vikaschoudhary16 · 2018-07-09T10:05:29Z

we might need cached devices for resource classes/compute resources. I would suggest to hold on removing cached devices from endpoint for some time.

figo · 2018-07-09T14:13:39Z

@vikaschoudhary16 thanks for jump in, could you clarify which resource class/compute resource need endpoint cache? i can not find any caller of it.

another fact need be considered is: DeviceManager already cached devices through healthyDevices and unhealthyDevices lists, we can discuss this in a follow up question after you help to clarify the first question, thank you!

vikaschoudhary16 · 2018-07-10T04:58:16Z

https://github.com/kubernetes/community/pull/2265/files

figo · 2018-07-10T16:14:05Z

@vikaschoudhary16 thanks for the link, i went through the ResourceClass KEP properly, could you help to clarify further on where endpoint cache is needed for ResourceClass?

i see the simplified model can help ResourceClass implementation potentially, thanks

RenaudWasTaken · 2018-07-10T23:44:11Z

@vikaschoudhary16 do you mind explaining your logic here?

jiayingz · 2018-07-13T00:32:42Z

@vikaschoudhary16 even with device attributes, I guess it is perhaps still simpler to cache device attribute info in manager.go instead of endpoint.go, especially given that our current model is that device attribute changes requires a node drain? Even if we relax that requirement in the future, I think we still expect device attribute changes as a rare event that an endpoint can just make a blind callback with the full list to update device manager. Given that this PR does simplify the logic quite a bit, should we consider to merge it earlier?

vikaschoudhary16 · 2018-07-13T03:47:20Z