Ensure that kubelet updates state on restart#130305
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| expectExpired bool | ||
| }{ | ||
| { | ||
| // Ensure that everything is properly initialized if the node status is not updated before |
There was a problem hiding this comment.
this is why we have the detection of changes
kubernetes/pkg/kubelet/kubelet_node_status.go
Line 485 in ad4d9b2
we are trying to be conservative with updates, can't we detect that we need to update instead of blindly update the node after the restart?
There was a problem hiding this comment.
But the issue is that there is no change. From what I understand, this updateNode is making sure the Node status is up to date. If we just restart the kubelet there is no change in the status and no reason to update anything. We need to make sure setLastObservedNodeAddresses is called once even if no change has been made. Maybe we can add it to updateNode?
There was a problem hiding this comment.
Just wanted to add that we only ever end up in this situation if the node status has NEVER (since kubelet start) been updated. So it adds a "forced" update on every kubelet (re)start, which is actually the same as the behavior before #128640. It would then update the node status after the status report interval no matter if there was a change or not.
|
Does this PR introduce an improvement that cluster operators could notice? I think it does, and so there should be a changelog entry. |
|
Updated description to add changelog entry. Edit: Nevermind, of course this should show up as a change since the change was already in v1.32.0. I don't know what I was thinking. Somehow I thought it wasn't released. I guess I am too used to testing unreleased things. |
| if kl.lastStatusReportTime.IsZero() { | ||
| return false | ||
| return true | ||
| } |
There was a problem hiding this comment.
so this is the problem, if we remove it , we keep exactly the same logic, the thing is that now is more obvius.
Can you please also expand on the comment to explain the problem in more detail and reference the bug you opened?
If the node.status is never updated after a kubelet restart, then the periodic update loop is never executed
something like that
|
/sig node Thanks for the detailed investigation, reviewers please check the referenced issue for more context, but the bottomline is that we may never do the periodic update after a kubelet restart if there are no changes on the node, as a consequence some race happens with certificates and we don't know if there are more things depending on this |
If the last status report time is zero, we consider the update status period expired to ensure that the status is properly updated on restarts. Otherwise, there is a risk that the kubelets internal state (e.g. addresses and hostname) is never set until there is a change in the node's status. Signed-off-by: Lennart Jern <lennart.jern@est.tech>
c76a52d to
0e38b0a
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lentzi90 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if !changed && !kl.isUpdateStatusPeriodExpired() { | ||
| kl.markVolumesFromNode(node) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
if updateNode returns changed==false, meaning there is no change in what we want to report for the node status, why is it important to call patchNodeStatus once on node startup?
There was a problem hiding this comment.
I don't really understand the description of this PR - "there is a risk that the kubelets internal state (e.g. addresses and hostname) is never set until there is a change in the node's status" ... shouldn't changed be true if that is the case?
There was a problem hiding this comment.
Sorry, I have updated the description to clarify. The gist is that the kubelet service certificate CSR cannot be created until getLastObservedNodeAddresses returns an actual address, and that doesn't happen until setLastObservedNodeAddresses has been called, which doesn't happen anywhere else than patchNodeStatus.
Note that the current behavior is also a clear change from how it was before #128640 and that it doesn't respect --node-status-update-frequency until the node status has changed the first time. After that it will happily update it regularly, but before there is any change, it will just wait indefinitely.
There was a problem hiding this comment.
I see. I think we should revert #128640, pick the revert to 1.32, and rework it in 1.33, since that caused the regression... the isUpdateStatusPeriodExpired function is really hard to reason about for zero-value timestamps
|
/hold we are going with the revert #130348 |
|
@lentzi90 if you depend on this behavior I encourage you to create an unit or integration test that ensure this does not regress anymore in the future, it can be tricky but it will be worth it if you depend on it |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Closing, since the revert makes the issue go away. I will try to figure out how to make a test that can catch this |
What type of PR is this?
/kind bug
What this PR does / why we need it:
If the last status report time is zero, we consider the update status period expired to ensure that the status is properly updated on restarts. Otherwise,
setLastObservedNodeAddressesis never called until there is a change in the node's status. The--node-status-update-frequencyflag is also not respected, since the status will not be updated until there is an actual change.After the first update to the nodes status (when
lastStatusReportTimeis updated to a non-zero value), the updates continue as expected, respecting the--node-status-update-frequency. So this is only about the very first update after a kubelet (re)start.The behavior changed in #128640, from
to
kubernetes/pkg/kubelet/kubelet_node_status.go
Lines 605 to 608 in ad4d9b2
Note that the zero case changed. The status used to be forcefully updated if the
lastStatusReportTimewas zero. Now it is not touched in that case.The specific thing that this change breaks for us is described in more detail in #130001. The gist is that the kubelet server certificate CSR cannot be created until
getLastObservedNodeAddressesreturns an actual address. This function relies on the kubelets internal state, NOT the node status, and this internal state is not set untilpatchNodeStatushas been called because that is the only place setLastObservedNodeAddresses is called.This all means that after a kubelet (re)start, as long as the node status hasn't changed, the kubelet cannot create the CSR for the serving certificate. And the
--node-status-update-frequencyflag isn't respected.Which issue(s) this PR fixes:
Fixes #130001
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: