Add random interval to nodeStatusReport interval every time after an actual node status change by mengqiy · Pull Request #128640 · kubernetes/kubernetes

mengqiy · 2024-11-07T04:38:58Z

Adding one time [-50% , +50%] randomness to nodeStatusReportFrequency after initial node status update.
It helps spread the load from kubelet evenly

This is the 2nd attempt of #128394. Previous PR was rolled back due to it caused flakiness to TestUpdateNodeStatusWithLease unit.

This PR contains everything in #128394 + one line code change to TestUpdateNodeStatusWithLease.

TestUpdateNodeStatusWithLease is not new test that were introduced by #128394.
TestUpdateNodeStatusWithLease was failing after we merge #128394 because in TestUpdateNodeStatusWithLease nodeStatusReportFrequency is set to 1m and the test expect an node status update to happen after 1m. But after #128394, this chance becomes 50%.
How I fix it in this PR is to nodeStatusReportFrequency to 30s so that we can always expect an node status update to happen due to time passage.

What type of PR is this?

/kind feature

What this PR does / why we need it:

The node status update traffic from kubelet can be almost synchronized in some scenarios and caused high CPU spikes. e.g. #124202

Which issue(s) this PR fixes:

Fixes #124202

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add a one-time random duration of up to 50% of kubelet's nodeStatusReportFrequency to help spread the node status update load evenly over time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

…actual node status change update TestUpdateNodeStatusWithLease this time to avoid flakiness

k8s-ci-robot · 2024-11-07T04:39:07Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mengqiy · 2024-11-07T04:39:31Z

/assign @SergeyKanzhelev

mengqiy · 2024-11-07T04:43:16Z

+	// You will add up to 50% of nodeStatusReportFrequency of additional random latency for
+	// kubelet to determine if update node status is needed due to time passage. We need to
+	// take that into consideration to ensure this test pass all time.
+	kubelet.nodeStatusReportFrequency = 30 * time.Second


This is the change to defalke TestUpdateNodeStatusWithLease .

did you run this test like 100 times in a loop locally?

this change makes sense to me. Once you run it many times in a row, please ping

@mengqiy @SergeyKanzhelev for flake proof we ask people to use the stress tool, for reference https://pkg.go.dev/golang.org/x/tools/cmd/stress

go test -c -race ./pkg/kubelet/ stress ./kubelet.test -test.run=TestUpdateNodeStatusWithLease 5s: 145 runs so far, 0 failures 10s: 337 runs so far, 0 failures 15s: 536 runs so far, 0 failures 20s: 738 runs so far, 0 failures 25s: 938 runs so far, 0 failures 30s: 1160 runs so far, 0 failures 35s: 1365 runs so far, 0 failures 40s: 1562 runs so far, 0 failures 45s: 1766 runs so far, 0 failures

This /lgtm

for flake proof we ask people to use the stress tool, for reference https://pkg.go.dev/golang.org/x/tools/cmd/stress

TIL
Thank you!

I run go test -v -count=1 ./pkg/kubelet/... 100 times locally in a loop. All are passing.

is not the same, we should require proof with stress for flakes,

aojea · 2024-11-07T09:15:12Z

/lgtm
seems it does not flake now #128640 (comment)

k8s-ci-robot · 2024-11-07T09:15:19Z

LGTM label has been added.

Details

Git tree hash: c2ed62560b9aadd016479b28db8c0d99748e971c

dims · 2024-11-07T11:56:41Z

seems it does not flake now #128640 (comment)

/approve
/lgtm

k8s-ci-robot · 2024-11-07T11:56:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, mengqiy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2025-02-21T15:36:27Z

+	if kl.lastStatusReportTime.IsZero() {
+		return false
+	}


This added a change behavior that regressed the kubelet #130305 @mengqiy @SergeyKanzhelev

The devil is in the details , before

shouldPatchNodeStatus := changed || kl.clock.Since(kl.lastStatusReportTime) >= kl.nodeStatusReportFrequency

but kl.clock.Since(kl.lastStatusReportTime) >= kl.nodeStatusReportFrequency if kl.lastStatusReportTime is time.Time{} or time.Zero is always true time.Since(time.Time{})

and in this patch we changed this meaning , so before it always updated the status after start but now , if there are no changes, it will never update the status

@liggitt is recommending to revert , and backport the revert so we can rework it in the current branch

#130305 (comment)

and I agree with him, after seeing the issue in #130001 and all the indirect dependencies associated to the status update we can not be sure that fix forwarding we don't find a new weird bug, this took @lentzi90 a lot of work to figure it out, who could imagine that the certificate creation was dependong on the node status update. cc: @dims

+1 to revert @aojea

cc @mengqiy

I agree that let's revert it and rework it in 1.33.

Add random interval to nodeStatusReport interval every time after an …

1003d36

…actual node status change update TestUpdateNodeStatusWithLease this time to avoid flakiness

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 7, 2024

k8s-ci-robot requested review from rphillips and yujuhong November 7, 2024 04:39

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 7, 2024

k8s-ci-robot assigned SergeyKanzhelev Nov 7, 2024

mengqiy commented Nov 7, 2024

View reviewed changes

This was referenced Nov 7, 2024

deflake TestUpdateNodeStatusWithLease test #128636

Closed

Revert "Add random interval to nodeStatusReport interval every time after an actual node status change #128629

Merged

k8s-ci-robot assigned aojea Nov 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

k8s-ci-robot assigned dims Nov 7, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2024

k8s-ci-robot merged commit c9024e7 into kubernetes:master Nov 7, 2024

k8s-ci-robot added this to the v1.32 milestone Nov 7, 2024

mengqiy deleted the spreadkubeletlaod branch November 7, 2024 16:42

aramase mentioned this pull request Nov 7, 2024

TestUpdateNodeStatusWithLease failing in pull-kubernetes-unit #128633

Closed

This was referenced Feb 20, 2025

Kubelet serving CSR never created #130001

Closed

Ensure that kubelet updates state on restart #130305

Closed

aojea reviewed Feb 21, 2025

View reviewed changes

aojea mentioned this pull request Feb 21, 2025

Revert "Add random interval to nodeStatusReport interval every time after an actual node status change" #130348

Merged

mengqiy mentioned this pull request Mar 19, 2025

Add random interval to nodeStatusReport interval every time after an actual node status change update or restart #130919

Merged

This was referenced Oct 29, 2025

Implement STATUS and GC commands for CNI v1.1.0 cybozu-go/coil#292

Open

coild can create a veth pair without node address on the hostLink cybozu-go/coil#350

Closed

Conversation

mengqiy commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Nov 7, 2024

Uh oh!

mengqiy commented Nov 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengqiy Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea commented Nov 7, 2024

Uh oh!

k8s-ci-robot commented Nov 7, 2024

Uh oh!

dims commented Nov 7, 2024

Uh oh!

k8s-ci-robot commented Nov 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mengqiy commented Nov 7, 2024 •

edited

Loading

mengqiy Nov 7, 2024 •

edited

Loading