Simplify device manager: make endpoint stateless#65948
Simplify device manager: make endpoint stateless#65948k8s-github-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
/ok-to-test
…On Sun, Jul 8, 2018, 04:46 k8s-ci-robot ***@***.***> wrote:
[APPROVALNOTIFIER] This PR is *NOT APPROVED*
This pull-request has been approved by: *figo
<#65948#>*
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: *vishh*
Assign the PR to them by writing /assign @vishh in a comment when ready.
The full list of commands accepted by this bot can be found here
<https://go.k8s.io/bot-commands>.
The pull request process is described here
<https://git.k8s.io/community/contributors/guide/owners.md#the-code-review-process>
Needs approval from an approver in each of these files:
- *pkg/kubelet/cm/devicemanager/OWNERS
<https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/devicemanager/OWNERS>*
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#65948 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACZyE35j_fde7CLXDWysAIbmdfPe8zKBks5uEXKPgaJpZM4VGkOR>
.
|
|
@kubernetes/sig-node-pr-reviews @figo |
|
@neolit123 thanks, this is something transparent to user, so i guess don't have to add to release note |
|
we might need cached devices for resource classes/compute resources. I would suggest to hold on removing cached devices from endpoint for some time. |
|
@vikaschoudhary16 thanks for jump in, could you clarify which resource class/compute resource need endpoint cache? i can not find any caller of it. another fact need be considered is: DeviceManager already cached devices through healthyDevices and unhealthyDevices lists, we can discuss this in a follow up question after you help to clarify the first question, thank you! |
|
@vikaschoudhary16 thanks for the link, i went through the ResourceClass KEP properly, could you help to clarify further on where endpoint cache is needed for ResourceClass? i see the simplified model can help ResourceClass implementation potentially, thanks |
|
@vikaschoudhary16 do you mind explaining your logic here? |
|
@vikaschoudhary16 even with device attributes, I guess it is perhaps still simpler to cache device attribute info in manager.go instead of endpoint.go, especially given that our current model is that device attribute changes requires a node drain? Even if we relax that requirement in the future, I think we still expect device attribute changes as a rare event that an endpoint can just make a blind callback with the full list to update device manager. Given that this PR does simplify the logic quite a bit, should we consider to merge it earlier? |
There was a problem hiding this comment.
i will keep this extra line just to separate the 3 test steps/sections.
There was a problem hiding this comment.
s/timeout while waiting
There was a problem hiding this comment.
how about checking capacity as well?
There was a problem hiding this comment.
will it add value if we check capacity as well?
There was a problem hiding this comment.
you can move lock from L344 to here and remove the if block at L351?
+1 @RenaudWasTaken i was thinking that we might need to cache device details, but as jiaying said, that can be cached in manager cache as well. Anyways we are also caching options there so it is even more aligned. @figo just a few nits and suggestions, mostly lgtm. this has really simplified logic of e.run(). Thanks for the change !! |
RenaudWasTaken
left a comment
There was a problem hiding this comment.
Mostly looks good to me
There was a problem hiding this comment.
This isn't used anywhere.
There was a problem hiding this comment.
thanks for review, this channel is added to help testing, test need to make sure callback been processed and then check stats, please refer to L109 and L113 at manager_test.go
There was a problem hiding this comment.
Ah I'm not sure that's a pattern we should use then. I'd rather you wrap that function in another callback in the test.
e.g:
originalCallback := m.callback
m.callback = func(...) ... {
originalCallback(...)
m.updateChan <- new(interface{})
}There was a problem hiding this comment.
good idea, done, thanks
There was a problem hiding this comment.
We're loosing the ability to react to a device appearing / disappearing or changing state.
That's not a problem since we're not doing anything now but it probably needs some discussion in a separate issue.
There was a problem hiding this comment.
thanks, you are right, we are not doing anything now, we can achieve this capability without endpoint cache, just need to compare the updated devices list with healthyDevices + unhealthyDevices, do you think we need to create an issue to track this?
|
/lgtm |
|
/assign @jiayingz |
|
/lgtm |
the caching layer on endpoint is redundant. Here are the 3 related objects in picture: devicemanager <-> endpoint <-> plugin Plugin is the source of truth for devices and device health status. devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates from plugin. So there is no point for endpoint caching devices, this patch is removing this caching layer on endpoint, Also removing the Manager.Devices() since i didn't find any caller of this other than test, i am adding a notification channel to facilitate testing, If we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, we don't have to call endpoint after all. This patch makes code more readable, data model been simplified.
|
@jiayingz @vikaschoudhary16 the code been rebased after #58755 merged, nothing really changed after the last round of review, please review and approve, thanks |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: figo, frapposelli, jiayingz, RenaudWasTaken, vikaschoudhary16 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test all [submit-queue is verifying that this PR is safe to merge] |
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
While reviewing devicemanager code, found the caching layer on endpoint is redundant.
Here are the 3 related objects in picture:
devicemanager <-> endpoint <-> plugin
plugin is the source of truth for devices and device health status.
devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates
from plugin.
So there is no point for endpoint to cache devices, this patch is removing the cache layer,
endpoint becomes stateless, which i believe should be the case (but i do welcome review
if i missed something here).
also removing the Manager.Devices() since i didn't find any caller of this other than test.
if we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, so don't have to call endpoint after all.
This patch makes code more readable, data model been simplified.
What this PR does / why we need it:
this patch simplify the device manager code, make it more maintainable.
Which issue(s) this PR fixes *:
this is a refactor of device manager code
Special notes for your reviewer:
will need to rebase the code if #58755 get checked-in first.
Release note:
/sig node
/cc @jiayingz @RenaudWasTaken @vishh @saad-ali @vikaschoudhary16 @vladimirvivien @anfernee