Skip to content

Use node addresses from informer in kubelet certificate manager#130362

Merged
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
aojea:node_addresses_init
Apr 25, 2025
Merged

Use node addresses from informer in kubelet certificate manager#130362
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
aojea:node_addresses_init

Conversation

@aojea

@aojea aojea commented Feb 22, 2025

Copy link
Copy Markdown
Member

The kubelet certificate manager was using a closure to get the node addresses, but this closure depended on a static field that was only updated during the node status update. This created a twisted dependency between the node.status reconcile loops and the certificate manager.

This commit fixes this issue by using the node informer to get the node addresses directly. This ensures that the kubelet always requests a certificate with the latest node addresses.

/kind bug
/kind cleanup

Fixes #130001

NONE

/sig node
/sig network

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Feb 22, 2025
@aojea

aojea commented Feb 22, 2025

Copy link
Copy Markdown
Member Author

/assign @liggitt @dims
/cc @lentzi90 @mengqiy

@dims

dims commented Feb 22, 2025

Copy link
Copy Markdown
Member

cc @cartermckinnon @mmerkes @tzneal

Comment thread pkg/kubelet/certificate/kubelet.go Outdated

@lentzi90 lentzi90 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a nice improvement! Thanks!
LGTM from my side

Comment thread pkg/kubelet/certificate/kubelet.go
Comment thread pkg/kubelet/certificate/kubelet.go Outdated
@aojea aojea force-pushed the node_addresses_init branch from 37e7c91 to 2b18dca Compare February 25, 2025 08:10
Comment thread pkg/kubelet/kubelet_getters.go Outdated
// wait for the cloud provider to initialize the node addresses
for _, taint := range node.Spec.Taints {
if taint.Key == cloudproviderapi.TaintExternalCloudProvider {
return nil

@cartermckinnon cartermckinnon Feb 25, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So kubelet will no longer request a cert until the taint is removed? I think that makes sense, given the external cloud provider "owns" the address list, but this will lengthen the period when kubelet is ready/accepting pods without having any serving cert. Can you add a log line here for visibility?

Somewhat related to the discussion in: #73047

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, related to this #125813

The node addresses are updated just after the taint is removed when the node is initialized by the cloud controller manager

_, err = cnc.kubeClient.CoreV1().Nodes().Update(ctx, newNode, metav1.UpdateOptions{})
if err != nil {
return err
}
removeCloudProviderTaintDelay.Observe(time.Since(newNode.ObjectMeta.CreationTimestamp.Time).Seconds())
// After adding, call UpdateNodeAddress to set the CloudProvider provided IPAddresses
// So that users do not see any significant delay in IP addresses being filled into the node
cnc.updateNodeAddress(ctx, newNode, instanceMetadata)

Bear in mind that the CSR requires to have at least one IP , that mean there should be one IP in node.Status.Addresses

// by default, require at least one IP before requesting a serving certificate
hasRequiredAddresses := len(ips) > 0

But I'm not sure of what path we want to follow:

  • no external cloud provider, everything is the same
  • external cloud provider
    • no --node-ip specified, effectively the behavior should be the same, as there will be no IPs on node.status.addresses
    • node-ip specified, this is the tricky part per kubelet: cloud-provider external addresses #121028
      • ip matches cloud-provider addresses, only one CSR is sent and we'll regress on node startup
      • ip partially matches, two CSR are sent, but it may regress or not
      • ip does not match, two CSR are sent but only the last one will be the one matching the actual addresses.

We have been so many times back and forth with node.status.address that now I think we should remove this check, and keep the existing behavior, if someone is specifies the --node-ip with the flag, then it may want to send the CSR with that IP ...

@aojea aojea force-pushed the node_addresses_init branch from 2b18dca to 269cf8b Compare February 27, 2025 13:59
@aojea

aojea commented Feb 27, 2025

Copy link
Copy Markdown
Member Author

Pushed last version based on discussion with @cartermckinnon in #130362 (comment), this keeps the same behavior, just removes the dependency on the node status loops by using the local informer. Based on the last year experience better the devil you know than the devil you don't

Comment thread pkg/kubelet/kubelet_getters.go
return nil, err
}
kl.lastStatusReportTime = kl.clock.Now()
kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cartermckinnon this will also improve the latency on cloud providers, since it will not have to wait for the patch of the kubelet,

@bart0sh

bart0sh commented Feb 27, 2025

Copy link
Copy Markdown
Contributor

/triage accepted

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

5 similar comments
@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-triage-robot

Copy link
Copy Markdown

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@enj

enj commented Mar 30, 2025

Copy link
Copy Markdown
Member

/lgtm cancel

To stop CI spam.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2025
@cartermckinnon

cartermckinnon commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

/retest

Code freeze is over, still want to get this one in @aojea ?

@cartermckinnon

Copy link
Copy Markdown
Contributor

Might need a rebase to fix the tests?

The kubelet certificate manager was using a closure to get the node addresses,
but this closure depended on a static field that was only updated during the node
status update. This created a twisted dependency between the node.status
reconcile loops and the certificate manager.

This commit fixes this issue by using the node informer to get the node addresses directly.
This ensures that the kubelet always requests a certificate with the latest node addresses.
@aojea aojea force-pushed the node_addresses_init branch from 174a9c8 to 1214dc2 Compare April 25, 2025 03:15
@aojea

aojea commented Apr 25, 2025

Copy link
Copy Markdown
Member Author

Might need a rebase to fix the tests?

it shouldn't but it will retrigger the jobs, @cartermckinnon can you lgtm again?

@aojea

aojea commented Apr 25, 2025

Copy link
Copy Markdown
Member Author

/retest

streamtranslator_test.go:249: # HELP apiserver_stream_translator_requests_total [ALPHA] Total number of requests that were handled by the StreamTranslatorProxy, which processes streaming RemoteCommand/V5
# TYPE apiserver_stream_translator_requests_total counter
apiserver_stream_translator_requests_total{code="200"} 1
-apiserver_stream_translator_requests_total{code="400"} 1

the window unit test are panicing and look very unstable https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-unit-windows-master , they should not be informing in current state since create confusion

@k8s-ci-robot

k8s-ci-robot commented Apr 25, 2025

Copy link
Copy Markdown
Contributor

@aojea: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit-windows-master 1214dc2 link false /test pull-kubernetes-unit-windows-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@aojea

aojea commented Apr 25, 2025

Copy link
Copy Markdown
Member Author

/lgtm cancel

To stop CI spam.

someone can lgtm again, the windows job does not seem related and is also not very stable

@BenTheElder

Copy link
Copy Markdown
Member

/skip

@BenTheElder

Copy link
Copy Markdown
Member

/lgtm
re-applying

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 32408a9c0c81fb45bf0a57f10d517c88e77a6521

@k8s-ci-robot k8s-ci-robot merged commit e2ccbd2 into kubernetes:master Apr 25, 2025
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Apr 25, 2025
@github-project-automation github-project-automation Bot moved this from Needs Reviewer to Done in SIG Node: code and documentation PRs Apr 25, 2025
@github-project-automation github-project-automation Bot moved this from In Review to Closed / Done in SIG Auth Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Kubelet serving CSR never created