Add random interval to nodeStatusReport interval every time after an actual node status change update or restart#130919
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/sig node |
|
/hold |
|
/assign @aojea @SergeyKanzhelev |
|
/retest |
977abd3 to
a2863a6
Compare
|
@aojea @SergeyKanzhelev PTAL |
|
we are post code freeze, no? I also prefer we decouple the node status from the certificate manager before we do this again #130362 , that PR is still pending approval |
|
I thought it's tomorrow. |
|
Changelog suggestion (is this correct?) -Add a one-time random duration of up to 50% of kubelet's nodeStatusReportFrequency to help spread the node status update load evenly over time.
+Added a delay to node updates after kubelet startup. A random offset, based on the configured `nodeStatusReportFrequency`, helps spread the traffic and load (due to node status updates) more evenly over time. The initial status update can be up to 50% earlier or 50% later than the regular schedule. |
|
/assign @mrunalp |
…actual node status change update or restart
a2863a6 to
b8a2127
Compare
| // updates to reach the apiserver at the same time. | ||
| // When condition 3 happens, random interval has already been used, and we want to reset the random delay, so that | ||
| // the node updates its status with fixed interval going forward. | ||
| if changed || kl.lastStatusReportTime.IsZero() { |
There was a problem hiding this comment.
Technically we will never get to kl.lastStatusReportTime.IsZero() here because isUpdateStatusPeriodExpired would return true in that case. (We would get kl.clock.Since(0) >= ...)
This is also seen in the first test case.
It still clarifies the logic so from that perspective it may still be good to have here.
There was a problem hiding this comment.
I don't think this comment is accurate. If kubelet just restarted and changed==false, we will have shouldPatchNodeStatus==true and we will use kl.lastStatusReportTime.IsZero() in this if statement to still apply the non-zero delayAfterNodeStatusChange. What am I missing?
Short of resolving this comment I think this PR is lgtm
There was a problem hiding this comment.
Ah you are correct, my bad!
There was a problem hiding this comment.
no worries, thank you for confirming.
|
@aojea @SergeyKanzhelev PTAL |
SergeyKanzhelev
left a comment
There was a problem hiding this comment.
/lgtm
/approve
/hold
keeping hold for @aojea to chime in.
|
LGTM label has been added. DetailsGit tree hash: 83c7574f41c8959a5d47c1dac69af1c13b2af400 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mengqiy, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @aojea |
|
/lgtm /hold cancel Thanks |
Adding one time [-50% , +50%] randomness to
nodeStatusReportFrequencyafter node status update or kubelet restart.It helps spread the load from kubelet evenly
#128640 introduced a bug that caused
lastStatusReportTimein kubelet not to be updated after a kubelet restart. As a consequence, the kubelet cert generation depends on this to be populated.See: #128640 (review)
Due to this bug, the previous commit was rolled back in master (1.33) and 1.32.
Zero value in lastStatusReportTime is now treated differently from last attempt.
What type of PR is this?
/kind feature
What this PR does / why we need it:
The node status update traffic from kubelet can be almost synchronized in some scenarios and caused high CPU spikes. e.g. #124202
Which issue(s) this PR fixes:
Fixes #124202
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: