Add random interval to nodeStatusReport interval every time after an actual node status change#128640
Conversation
…actual node status change update TestUpdateNodeStatusWithLease this time to avoid flakiness
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @SergeyKanzhelev |
| // You will add up to 50% of nodeStatusReportFrequency of additional random latency for | ||
| // kubelet to determine if update node status is needed due to time passage. We need to | ||
| // take that into consideration to ensure this test pass all time. | ||
| kubelet.nodeStatusReportFrequency = 30 * time.Second |
There was a problem hiding this comment.
This is the change to defalke TestUpdateNodeStatusWithLease .
There was a problem hiding this comment.
did you run this test like 100 times in a loop locally?
There was a problem hiding this comment.
this change makes sense to me. Once you run it many times in a row, please ping
There was a problem hiding this comment.
@mengqiy @SergeyKanzhelev for flake proof we ask people to use the stress tool, for reference https://pkg.go.dev/golang.org/x/tools/cmd/stress
go test -c -race ./pkg/kubelet/
stress ./kubelet.test -test.run=TestUpdateNodeStatusWithLease
5s: 145 runs so far, 0 failures
10s: 337 runs so far, 0 failures
15s: 536 runs so far, 0 failures
20s: 738 runs so far, 0 failures
25s: 938 runs so far, 0 failures
30s: 1160 runs so far, 0 failures
35s: 1365 runs so far, 0 failures
40s: 1562 runs so far, 0 failures
45s: 1766 runs so far, 0 failures
This /lgtm
There was a problem hiding this comment.
for flake proof we ask people to use the stress tool, for reference https://pkg.go.dev/golang.org/x/tools/cmd/stress
TIL
Thank you!
I run go test -v -count=1 ./pkg/kubelet/... 100 times locally in a loop. All are passing.
There was a problem hiding this comment.
is not the same, we should require proof with stress for flakes,
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: c2ed62560b9aadd016479b28db8c0d99748e971c |
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, mengqiy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| if kl.lastStatusReportTime.IsZero() { | ||
| return false | ||
| } |
There was a problem hiding this comment.
This added a change behavior that regressed the kubelet #130305 @mengqiy @SergeyKanzhelev
The devil is in the details , before
shouldPatchNodeStatus := changed || kl.clock.Since(kl.lastStatusReportTime) >= kl.nodeStatusReportFrequency
but kl.clock.Since(kl.lastStatusReportTime) >= kl.nodeStatusReportFrequency if kl.lastStatusReportTime is time.Time{} or time.Zero is always true time.Since(time.Time{})
and in this patch we changed this meaning , so before it always updated the status after start but now , if there are no changes, it will never update the status
@liggitt is recommending to revert , and backport the revert so we can rework it in the current branch
and I agree with him, after seeing the issue in #130001 and all the indirect dependencies associated to the status update we can not be sure that fix forwarding we don't find a new weird bug, this took @lentzi90 a lot of work to figure it out, who could imagine that the certificate creation was dependong on the node status update. cc: @dims
There was a problem hiding this comment.
I agree that let's revert it and rework it in 1.33.
Adding one time [-50% , +50%] randomness to
nodeStatusReportFrequencyafter initial node status update.It helps spread the load from kubelet evenly
This is the 2nd attempt of #128394. Previous PR was rolled back due to it caused flakiness to
TestUpdateNodeStatusWithLeaseunit.This PR contains everything in #128394 + one line code change to
TestUpdateNodeStatusWithLease.TestUpdateNodeStatusWithLeaseis not new test that were introduced by #128394.TestUpdateNodeStatusWithLeasewas failing after we merge #128394 because inTestUpdateNodeStatusWithLeasenodeStatusReportFrequencyis set to 1m and the test expect an node status update to happen after 1m. But after #128394, this chance becomes 50%.How I fix it in this PR is to
nodeStatusReportFrequencyto 30s so that we can always expect an node status update to happen due to time passage.What type of PR is this?
/kind feature
What this PR does / why we need it:
The node status update traffic from kubelet can be almost synchronized in some scenarios and caused high CPU spikes. e.g. #124202
Which issue(s) this PR fixes:
Fixes #124202
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: