Skip to content

Check Node serving certificate in Node readiness check#73047

Closed
awly wants to merge 1 commit into
kubernetes:masterfrom
awly:node-serving-cert-health
Closed

Check Node serving certificate in Node readiness check#73047
awly wants to merge 1 commit into
kubernetes:masterfrom
awly:node-serving-cert-health

Conversation

@awly

@awly awly commented Jan 17, 2019

Copy link
Copy Markdown
Contributor

This only triggers when Node serving certificate bootstrap is enabled.
Without successful bootstrap, Node can't serve exec/logs calls from
master.

This PR will mark Node NotReady until certificate is provisioned or when
it expires.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

When serving certificate bootstrap is enabled in kubelet (serverTLSBootstrap=true in kubelet config), Node will be marked NotReady until certificate is provisioned.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 17, 2019
@awly

awly commented Jan 17, 2019

Copy link
Copy Markdown
Contributor Author

/assign @liggitt
/sig auth

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2019
Comment thread pkg/kubelet/nodestatus/setters.go Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only affects readiness if the kubelet has serving cert rotation enabled, right? are there other readiness checks that can go bad after startup like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only affects readiness if the kubelet has serving cert rotation enabled, right?

Yes, if static serving cert is used this check will never trigger. Ideally we could also watch static cert's expiration. Is it useful for this PR?

are there other readiness checks that can go bad after startup like this?

IIUC, yes. runtimeErrorsFunc and networkErrorsFunc rely on periodic health checks of CRI and CNI. If those start failing, node status flips.

Comment thread pkg/kubelet/kubelet_node_status.go Outdated

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a doubt that how we are going to handle if k1.serverCertificateManager = nil.? Thanks in advance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When k1.serverCertificateManager == nil, serving certificate rotation is not enabled.
Node status will not be tied to serving certificate status. So (assuming all other checks here pass) Node will come up Ready.

@mourya007

Copy link
Copy Markdown

/test pull-kubernetes-local-e2e-containerized
I believe It should pass normally.

@awly

awly commented Jan 28, 2019

Copy link
Copy Markdown
Contributor Author

Ping @liggitt
/retest

@liggitt

liggitt commented Jan 29, 2019

Copy link
Copy Markdown
Member

/retest

I think this makes sense, but had a few questions:

  • do static pods block until ready?
  • does going notready cause existing pods to get evicted? if so, immediately or after a delay?

also, the help around the rotate server certificates flag and config options needs to be updated

This only triggers when Node serving certificate bootstrap is enabled.
Without successful bootstrap, Node can't serve exec/logs calls from
master.

This PR will mark Node NotReady until certificate is provisioned or when
it expires.
@awly

awly commented Jan 30, 2019

Copy link
Copy Markdown
Contributor Author

do static pods block until ready?

They shouldn't. In fact, static pods should run even if kubelet fails to bootstrap client credentials.

does going notready cause existing pods to get evicted? if so, immediately or after a delay?

Yes, after pod-eviction-timeout (kube-controller-manager flag, default is 5min).

the help around the rotate server certificates flag and config options needs to be updated

Done

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: awly
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: liggitt

If they are not already assigned, you can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@awly: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized a45203e link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@liggitt

liggitt commented Jan 30, 2019

Copy link
Copy Markdown
Member

thinking through this some more... all the other readiness checks (to my knowledge) are oriented around the ability of a node to run a workload (e.g. can the network be set up for it, is the cri ready, are storage drivers ready, etc).

This is adding a condition solely oriented around whether calls from the apiserver could be made to the pod, once it was running. Is the ability of the node to accept exec/attach/log calls an inherent part of running a pod? If this is the first of this category of readiness check to be made on a node, would like an ack from @kubernetes/sig-node-pr-reviews that this is coherent with intended use of node readiness

@awly

awly commented Jan 31, 2019

Copy link
Copy Markdown
Contributor Author

This came from conformance tests (

It("should support exec", func() {
) failing on large GKE clusters after we enabled signing on serving kubelet certs.
This seems to indicate that working exec/logs/attach are currently required to mark a cluster "conformant".

Whether or not this is a true requirement or just overzealous tests - I don't know.

@awly

awly commented Feb 14, 2019

Copy link
Copy Markdown
Contributor Author

Ping @liggitt @kubernetes/sig-node-pr-reviews

@smarterclayton

smarterclayton commented Feb 16, 2019

Copy link
Copy Markdown
Contributor

Just because a node is "conformant" doesn't mean we want the node to be marked not ready because the cert is being rotated. I'm kind of uncomfortable making nodes not ready to run workloads just because some workloads that need logs can't run. Others potentially could run without logs.

I'm not opposed to a condition being exposed, but I think it's a bigger step to make it part of ready.

Changing node readiness has a lot of implications as Jordan noted - while this may feel correct, have we thought how it would impact people who have already deployed rotation in production? Do we have any user feedback on it?

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@awly: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2019
@awly

awly commented Apr 10, 2019

Copy link
Copy Markdown
Contributor Author

Abandoning this PR.
Looks like it's a more controversial change than I assumed. It's not hurting us too much, only when running conformance tests on large clusters.

If this gets problematic enough, sounds like some retries in the test would be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants