Skip to content

feat: parallelize nodes update#1133

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:masterfrom
AhmedGrati:feat-parallelize-nodes-update
Jun 2, 2023
Merged

feat: parallelize nodes update#1133
k8s-ci-robot merged 1 commit into
kubernetes-sigs:masterfrom
AhmedGrati:feat-parallelize-nodes-update

Conversation

@AhmedGrati

Copy link
Copy Markdown

Resolves #1096.

@k8s-ci-robot k8s-ci-robot requested review from jjacobelli and kad April 14, 2023 09:28
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 14, 2023
@netlify

netlify Bot commented Apr 14, 2023

Copy link
Copy Markdown

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit b3cfe17
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/6479c77565ad3a000871bb62
😎 Deploy Preview https://deploy-preview-1133--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 14, 2023

@marquiz marquiz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AhmedGrati for the PR.

I think we need to throttle the max parallelism (i.e. how many concurrent go routines are generated) somehow. Say we have a cluster with 2000 nodes we don't probably want the fire 2000 goroutines at the same time

/milestone v0.14.0

@k8s-ci-robot k8s-ci-robot added this to the v0.14.0 milestone Apr 14, 2023
@PiotrProkop

Copy link
Copy Markdown
Contributor

Can we encorporate workqueue here? Have a flag than configures the number of consumers(which will update specific node) and we will just workqueue.Add nodes when we want to update them.

We can use https://pkg.go.dev/k8s.io/client-go/util/workqueue for example.

@marquiz

marquiz commented Apr 14, 2023

Copy link
Copy Markdown
Contributor

Can we encorporate workqueue here?

Yeah, for example something like this

@AhmedGrati AhmedGrati changed the title feat: parallelize nodes update [WIP] feat: parallelize nodes update Apr 15, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2023
@AhmedGrati

Copy link
Copy Markdown
Author

Hey @PiotrProkop, I've just come up with this design to include workqueue, and to have a goroutines pool responsible for labeling nodes.
NFD drawio
But at the same time @marquiz, suggested that we can use SizedWaitGroup.
I wanted to gather your input here, on what should we do next.

  1. Keep it simple as is now, and add the SizedWaitGroup.
  2. Implement a more sophisticated solution and add the workqueue package.

@AhmedGrati

Copy link
Copy Markdown
Author

ping @marquiz @PiotrProkop

@PiotrProkop

Copy link
Copy Markdown
Contributor

I think decoupling labeling into separate workqueue is better approach for the future. It would move labeling responsibility into separate component and we would have to just Add labeling request from the main loop.

On the other hand I know that implementing it is non trivial and I think SizedWaitGroup will also work but we would just have to make sure that we move all labeling into single SizedWaitGroup or single ErrGroup.

@marquiz WDYT?

@marquiz

marquiz commented Apr 22, 2023

Copy link
Copy Markdown
Contributor

I think decoupling labeling into separate workqueue is better approach for the future. It would move labeling responsibility into separate component and we would have to just Add labeling request from the main loop.

On the other hand I know that implementing it is non trivial and I think SizedWaitGroup will also work but we would just have to make sure that we move all labeling into single SizedWaitGroup or single ErrGroup.

I agree with @PiotrProkop here. Using workqueues would be more future proof and architecturally cleaner for the long run.

However, as @PiotrProkop said it's not trivial and I think lot more work (including tests) than the simple-and-stupid SizedWaitGroup. Using SizedWaitGroup should be safe and simple (currently) as there is only one place (and goroutine) from where we do labeling.

Thus, I'm fine with either approach. You can choose 😉

@AhmedGrati

Copy link
Copy Markdown
Author

Thanks for your input @marquiz @PiotrProkop, I'll implement it using workqueue.

@AhmedGrati AhmedGrati marked this pull request as draft May 4, 2023 14:08
@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch from de0c0cf to d908a70 Compare May 12, 2023 21:15
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 12, 2023
@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch 2 times, most recently from dca45f6 to 64a284e Compare May 12, 2023 21:16
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 12, 2023
@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch 6 times, most recently from bdea8c1 to 2906059 Compare May 13, 2023 20:02
@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch 3 times, most recently from 5f4dc3f to 9bf7037 Compare May 31, 2023 12:16
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 31, 2023
@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch from 9bf7037 to 3a1953d Compare May 31, 2023 15:02
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2023
@marquiz

marquiz commented May 31, 2023

Copy link
Copy Markdown
Contributor

@AhmedGrati you might want to rebase once more to run the logcheck (for structured logging)

@AhmedGrati AhmedGrati force-pushed the feat-parallelize-nodes-update branch from 3a1953d to b62a63c Compare May 31, 2023 22:30
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
var lock = &sync.Mutex{}
var instance *nodeUpdaterPool

func getNodeUpdaterPool(nfdMaster nfdMaster) *nodeUpdaterPool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This singleton approach looks a bit weird as you give nfdMaster as an argument but that argument is totally omitted in subsequent calls.

How about changing this to

Suggested change
func getNodeUpdaterPool(nfdMaster nfdMaster) *nodeUpdaterPool {
func newNodeUpdaterPool(nfdMaster nfdMaster) *nodeUpdaterPool

and then calling this in NewNfdMaster and storing the instance in the nfdMaster struct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, It would be much more readable, since the nodeUpdaterPool is only used from within nfdMaster.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that result in a circular dependency?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nfdMaster contains nodeUpdaterPool and nodeUpdaterPool contains nfdMaster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about it like doubly-linked list


defer queue.Done(nodeName)

if err := u.nfdMaster.nfdAPIUpdateOneNode(nodeName.(string)); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if this is the only place where nfdAPIUpdateOneNode is called you can just remove this func from nfdMaster and make it nodeUpdaterPoll func, this way you don't need to pass nfdMaster to nodeUpdaterPool but prolly just Kubernetes client.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might not be that easy as nfdAPIUpdateOneNode() is very tightly coupled with nfdMaster, dependent on many fields/parameters of nfdMaster down in the call chain. The architecture could be refactored but I think that will probably be a bigger effort and better be a separate PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we should create issue to track this work.

@AhmedGrati

Copy link
Copy Markdown
Author

/unhold

Comment thread pkg/nfd-master/node-updater-pool.go
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
@AhmedGrati

Copy link
Copy Markdown
Author

@marquiz any idea why is this failing?

Comment thread pkg/nfd-master/node-updater-pool.go Outdated
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
Comment thread pkg/nfd-master/nfd-master.go Outdated
@AhmedGrati

Copy link
Copy Markdown
Author

/retest-required

Comment thread pkg/nfd-master/nfd-master.go Outdated
@marquiz

marquiz commented Jun 2, 2023

Copy link
Copy Markdown
Contributor

@AhmedGrati also one with: could you clean up the commit message 😇

Comment thread pkg/nfd-master/node-updater-pool.go
Comment thread pkg/nfd-master/node-updater-pool.go Outdated
This PR aims to optimize the process of updating nodes with
corresponding features. In fact, previously, we were updating nodes
sequentially even though they are independent from each other.
Therefore, we integrated new components: LabelersNodePool which is
responsible for spininng a goroutine whenever there's a request for
updating nodes, and a Workqueue which is responsible for holding nodes names
that should be updated.

Signed-off-by: AhmedGrati <ahmedgrati1999@gmail.com>

@marquiz marquiz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job and thanks for the persistence on this @AhmedGrati. I don't think I have anything to add for now -> approved

@PiotrProkop could you take look, with fresh eyes?

@marquiz marquiz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job and thanks for the persistence on this @AhmedGrati. I don't think I have anything to add for now -> approved

@PiotrProkop could you take look, with fresh eyes?

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AhmedGrati, marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AhmedGrati, marquiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented Jun 2, 2023

Copy link
Copy Markdown

Codecov Report

Merging #1133 (b3cfe17) into master (d64398f) will increase coverage by 0.49%.
The diff coverage is 68.83%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1133      +/-   ##
==========================================
+ Coverage   28.72%   29.22%   +0.49%     
==========================================
  Files          51       52       +1     
  Lines        7139     7219      +80     
==========================================
+ Hits         2051     2110      +59     
- Misses       4871     4883      +12     
- Partials      217      226       +9     
Impacted Files Coverage Δ
pkg/nfd-master/node-updater-pool.go 57.69% <57.69%> (ø)
pkg/nfd-master/nfd-master.go 43.81% <92.00%> (+1.92%) ⬆️

@PiotrProkop PiotrProkop left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm


defer queue.Done(nodeName)

if err := u.nfdMaster.nfdAPIUpdateOneNode(nodeName.(string)); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we should create issue to track this work.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 046a11ad55b9f94015cfdd23dff74551e04792d3

@marquiz marquiz mentioned this pull request Sep 6, 2023
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement: parallelize nodes update

5 participants