Skip to content

Simplify device manager: make endpoint stateless#65948

Merged
k8s-github-robot merged 1 commit intokubernetes:masterfrom
figo:stateless
Jul 30, 2018
Merged

Simplify device manager: make endpoint stateless#65948
k8s-github-robot merged 1 commit intokubernetes:masterfrom
figo:stateless

Conversation

@figo
Copy link
Copy Markdown
Contributor

@figo figo commented Jul 8, 2018

While reviewing devicemanager code, found the caching layer on endpoint is redundant.
Here are the 3 related objects in picture:

devicemanager <-> endpoint <-> plugin

plugin is the source of truth for devices and device health status.
devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint to cache devices, this patch is removing the cache layer,
endpoint becomes stateless, which i believe should be the case (but i do welcome review
if i missed something here).

also removing the Manager.Devices() since i didn't find any caller of this other than test.

if we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, so don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.

What this PR does / why we need it:
this patch simplify the device manager code, make it more maintainable.

Which issue(s) this PR fixes *:
this is a refactor of device manager code

Special notes for your reviewer:
will need to rebase the code if #58755 get checked-in first.

Release note:

None

/sig node
/cc @jiayingz @RenaudWasTaken @vishh @saad-ali @vikaschoudhary16 @vladimirvivien @anfernee

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Jul 8, 2018
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 8, 2018
@RenaudWasTaken
Copy link
Copy Markdown
Contributor

RenaudWasTaken commented Jul 8, 2018 via email

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 8, 2018
@neolit123
Copy link
Copy Markdown
Member

@kubernetes/sig-node-pr-reviews

@figo
does this need a release note and/or docs updates?

@figo
Copy link
Copy Markdown
Contributor Author

figo commented Jul 8, 2018

@neolit123 thanks, this is something transparent to user, so i guess don't have to add to release note

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

we might need cached devices for resource classes/compute resources. I would suggest to hold on removing cached devices from endpoint for some time.

@figo
Copy link
Copy Markdown
Contributor Author

figo commented Jul 9, 2018

@vikaschoudhary16 thanks for jump in, could you clarify which resource class/compute resource need endpoint cache? i can not find any caller of it.

another fact need be considered is: DeviceManager already cached devices through healthyDevices and unhealthyDevices lists, we can discuss this in a follow up question after you help to clarify the first question, thank you!

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

@figo
Copy link
Copy Markdown
Contributor Author

figo commented Jul 10, 2018

@vikaschoudhary16 thanks for the link, i went through the ResourceClass KEP properly, could you help to clarify further on where endpoint cache is needed for ResourceClass?

i see the simplified model can help ResourceClass implementation potentially, thanks

@RenaudWasTaken
Copy link
Copy Markdown
Contributor

@vikaschoudhary16 do you mind explaining your logic here?

@jiayingz
Copy link
Copy Markdown
Contributor

@vikaschoudhary16 even with device attributes, I guess it is perhaps still simpler to cache device attribute info in manager.go instead of endpoint.go, especially given that our current model is that device attribute changes requires a node drain? Even if we relax that requirement in the future, I think we still expect device attribute changes as a rare event that an endpoint can just make a blind callback with the full list to update device manager. Given that this PR does simplify the logic quite a bit, should we consider to merge it earlier?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will keep this extra line just to separate the 3 test steps/sections.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/timeout while waiting

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about checking capacity as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it add value if we check capacity as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move lock from L344 to here and remove the if block at L351?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

@jiayingz

I guess it is perhaps still simpler to cache device attribute info in manager.go instead of endpoint.go

+1

@RenaudWasTaken i was thinking that we might need to cache device details, but as jiaying said, that can be cached in manager cache as well. Anyways we are also caching options there so it is even more aligned.

@figo just a few nits and suggestions, mostly lgtm. this has really simplified logic of e.run(). Thanks for the change !!

Copy link
Copy Markdown
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anywhere.

Copy link
Copy Markdown
Contributor Author

@figo figo Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review, this channel is added to help testing, test need to make sure callback been processed and then check stats, please refer to L109 and L113 at manager_test.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'm not sure that's a pattern we should use then. I'd rather you wrap that function in another callback in the test.
e.g:

originalCallback := m.callback
m.callback = func(...) ... {
     originalCallback(...)
     m.updateChan <- new(interface{})
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done, thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're loosing the ability to react to a device appearing / disappearing or changing state.
That's not a problem since we're not doing anything now but it probably needs some discussion in a separate issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you are right, we are not doing anything now, we can achieve this capability without endpoint cache, just need to compare the updated devices list with healthyDevices + unhealthyDevices, do you think we need to create an issue to track this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@figo +1

@vikaschoudhary16
Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2018
@figo
Copy link
Copy Markdown
Contributor Author

figo commented Jul 17, 2018

/assign @jiayingz

@frapposelli
Copy link
Copy Markdown

/lgtm

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2018
the caching layer on endpoint is redundant.

Here are the 3 related objects in picture:
devicemanager <-> endpoint <-> plugin

Plugin is the source of truth for devices
and device health status.

devicemanager maintain healthyDevices,
unhealthyDevices, allocatedDevices based on updates
from plugin.

So there is no point for endpoint caching devices,
this patch is removing this caching layer on endpoint,

Also removing the Manager.Devices() since i didn't
find any caller of this other than test, i am adding a
notification channel to facilitate testing,

If we need to get all devices from manager in future,
it just need to return healthyDevices + unhealthyDevices,
we don't have to call endpoint after all.

This patch makes code more readable, data model been simplified.
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 30, 2018
@figo
Copy link
Copy Markdown
Contributor Author

figo commented Jul 30, 2018

@jiayingz @vikaschoudhary16 the code been rebased after #58755 merged, nothing really changed after the last round of review, please review and approve, thanks

Copy link
Copy Markdown
Contributor

@RenaudWasTaken RenaudWasTaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2018
@jiayingz
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: figo, frapposelli, jiayingz, RenaudWasTaken, vikaschoudhary16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2018
@k8s-github-robot
Copy link
Copy Markdown

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 742a90c into kubernetes:master Jul 30, 2018
@figo figo deleted the stateless branch July 30, 2018 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants