Skip to content

Add random interval to nodeStatusReport interval every time after an actual node status change update or restart#130919

Merged
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
mengqiy:kubelet_rand_interval
Jun 19, 2025
Merged

Add random interval to nodeStatusReport interval every time after an actual node status change update or restart#130919
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
mengqiy:kubelet_rand_interval

Conversation

@mengqiy

@mengqiy mengqiy commented Mar 19, 2025

Copy link
Copy Markdown
Member

Adding one time [-50% , +50%] randomness to nodeStatusReportFrequency after node status update or kubelet restart.
It helps spread the load from kubelet evenly

#128640 introduced a bug that caused lastStatusReportTime in kubelet not to be updated after a kubelet restart. As a consequence, the kubelet cert generation depends on this to be populated.
See: #128640 (review)
Due to this bug, the previous commit was rolled back in master (1.33) and 1.32.

Zero value in lastStatusReportTime is now treated differently from last attempt.

What type of PR is this?

/kind feature

What this PR does / why we need it:

The node status update traffic from kubelet can be almost synchronized in some scenarios and caused high CPU spikes. e.g. #124202

Which issue(s) this PR fixes:

Fixes #124202

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added a delay to node updates after kubelet startup. A random offset, based on the configured `nodeStatusReportFrequency`, helps spread the traffic and load (due to node status updates) more evenly over time. The initial status update can be up to 50% earlier or 50% later than the regular schedule.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 19, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 19, 2025
@mengqiy

mengqiy commented Mar 19, 2025

Copy link
Copy Markdown
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 19, 2025
@mengqiy

mengqiy commented Mar 19, 2025

Copy link
Copy Markdown
Member Author

/hold
to ensure multiple people can review it.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 19, 2025
@mengqiy

mengqiy commented Mar 19, 2025

Copy link
Copy Markdown
Member Author

/assign @aojea @SergeyKanzhelev
/cc @dims @hakuna-matatah

@mengqiy

mengqiy commented Mar 19, 2025

Copy link
Copy Markdown
Member Author

/retest

Comment thread pkg/kubelet/kubelet_node_status_test.go Outdated
Comment thread pkg/kubelet/kubelet_node_status.go Outdated
Comment thread pkg/kubelet/kubelet_node_status_test.go
@mengqiy mengqiy force-pushed the kubelet_rand_interval branch from 977abd3 to a2863a6 Compare March 21, 2025 04:50
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 21, 2025
@mengqiy

mengqiy commented Mar 21, 2025

Copy link
Copy Markdown
Member Author

@aojea @SergeyKanzhelev PTAL

@aojea

aojea commented Mar 21, 2025

Copy link
Copy Markdown
Member

we are post code freeze, no?

I also prefer we decouple the node status from the certificate manager before we do this again #130362 , that PR is still pending approval

@mengqiy

mengqiy commented Mar 21, 2025

Copy link
Copy Markdown
Member Author

I thought it's tomorrow.
Make sense. I will rebase once #130362 is merged

@sftim

sftim commented Mar 25, 2025

Copy link
Copy Markdown
Contributor

Changelog suggestion (is this correct?)

-Add a one-time random duration of up to 50% of kubelet's nodeStatusReportFrequency to help spread the node status update load evenly over time.
+Added a delay to node updates after kubelet startup. A random offset, based on the configured `nodeStatusReportFrequency`, helps spread the traffic and load (due to node status updates) more evenly over time. The initial status update can be up to 50% earlier or 50% later than the regular schedule.

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node: code and documentation PRs Apr 7, 2025
@dims

dims commented Apr 30, 2025

Copy link
Copy Markdown
Member

/assign @mrunalp

@mengqiy mengqiy force-pushed the kubelet_rand_interval branch from a2863a6 to b8a2127 Compare June 17, 2025 05:55
@mengqiy

mengqiy commented Jun 17, 2025

Copy link
Copy Markdown
Member Author

Rebased on top of #130362
@sftim Updated the release note as you suggested.

// updates to reach the apiserver at the same time.
// When condition 3 happens, random interval has already been used, and we want to reset the random delay, so that
// the node updates its status with fixed interval going forward.
if changed || kl.lastStatusReportTime.IsZero() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we will never get to kl.lastStatusReportTime.IsZero() here because isUpdateStatusPeriodExpired would return true in that case. (We would get kl.clock.Since(0) >= ...)
This is also seen in the first test case.
It still clarifies the logic so from that perspective it may still be good to have here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate. If kubelet just restarted and changed==false, we will have shouldPatchNodeStatus==true and we will use kl.lastStatusReportTime.IsZero() in this if statement to still apply the non-zero delayAfterNodeStatusChange. What am I missing?

Short of resolving this comment I think this PR is lgtm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are correct, my bad!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, thank you for confirming.

@mengqiy

mengqiy commented Jun 17, 2025

Copy link
Copy Markdown
Member Author

@aojea @SergeyKanzhelev PTAL

@SergeyKanzhelev SergeyKanzhelev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

keeping hold for @aojea to chime in.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 83c7574f41c8959a5d47c1dac69af1c13b2af400

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mengqiy, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2025
@mengqiy

mengqiy commented Jun 19, 2025

Copy link
Copy Markdown
Member Author

/assign @aojea

@aojea

aojea commented Jun 19, 2025

Copy link
Copy Markdown
Member

/lgtm

/hold cancel

Thanks

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0f478e5 into kubernetes:master Jun 19, 2025
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 19, 2025
@github-project-automation github-project-automation Bot moved this from Needs Reviewer to Done in SIG Node: code and documentation PRs Jun 19, 2025
@mengqiy mengqiy deleted the kubelet_rand_interval branch June 19, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Development

Successfully merging this pull request may close these issues.

kubelet periodic node status update should not be aligned

10 participants