Add healthz check for KMS Providers on kube-apiserver.#78540
Add healthz check for KMS Providers on kube-apiserver.#78540k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
/assign @awly |
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
/test pull-kubernetes-bazel-build |
|
/test pull-kubernetes-bazel-test |
There was a problem hiding this comment.
Could healthz probes be passed over the plugin Endpoint above?
There was a problem hiding this comment.
I assume that you meant to create a dedicated type for healthz probe. If I understood you correctly, the latest commit accomplishes this.
There was a problem hiding this comment.
I meant adding a healthz RPC to the KMS plugin itself on the gRPC service.
So you could send those healthz requests over KMSConfiguration.Endpoint (the unix socket) instead of a separate new HTTP endpoint
There was a problem hiding this comment.
I think this is a good idea, but my survey of the currently implemented plugins shows that they do not have this functionality yet.
This is something we could add later.
There was a problem hiding this comment.
Do currently implemented plugins already expose a HTTP endpoint for healthz?
If not, you could put in a healthz grpc endpoint and just skip polling health if it returns "unimplemented" (or whatever error grpc will send).
There was a problem hiding this comment.
Yes, Google CloudKMS Plugin and AWS CloudHSM plugins have implemented healthz. Hence, these plugins could take advantage of this feature right away.
There was a problem hiding this comment.
So this only happens on kube-apiserver startup, right?
What if kms plugin becomes unhealthy later on? I assume it will just show up as errors from all Secrets operations?
There was a problem hiding this comment.
Yes, this is only to ensure that kube-apiserver comes-up after kms-plugin is reporting healthz OK.
KMS-Plugin may have its own liveness probe.
And yes, kube-apiserver will log and increment failure metrics when the plugin is down.
There was a problem hiding this comment.
Could this only return the error? If arg parsing fails or if healthz polling fails, return a non-nil error
There was a problem hiding this comment.
Maybe, but I want to deffirentate between the scenarios
- Healthz could not be checked due to configuration errors - false, error
- Heathz returned non 200 status - false,nil
- Healthz return 200 - true,nil
I think this approach fits with the signature of the poll.Wait.
There was a problem hiding this comment.
Hmm, but shouldn't you retry even if healthz endpoint is unreachable?
For example when KMS plugin starts later than kube-apiserver.
If yes, then there's no difference between non-200 status and any request errors
There was a problem hiding this comment.
Let me step back.
wait.PollImmediate takes a condition function of the following signature:
type ConditionFunc func() (done bool, err error)
Therefore, getHelathz must match that signature.
I think the reason for that signature is that polling status has three potential states:
- true - we are done ,no need to retry further (kms-plugin OK)
- false - poll failed, keep retrying (kms-plugin is not up yet)
- error - got negative status, no need to poll anymore (ex. missing permissions on the key)
There was a problem hiding this comment.
getHealthz doesn't have to conform to what wait.PollImmediate expects, you can always adapt it with a closure.
I understand the purpose of those separate return values, but my point is: getHealthz should never cause wait.PollImmediate to exit early with an error.
Currently you return non-nil error when response code is not 200. I think kube-apiserver should keep retrying in that case, let the plugin flip response code to 200 later on.
Then KMS plugin can say "i'm up, but not ready to serve requests yet" and kube-apiserver won't crashloop because of it.
There was a problem hiding this comment.
ACK on the signature, you are correct.
Here a scenario:
kube-apiserver polls kms-plugin and the plugin responds with 500 because KMS returned "Key Destroyed Error" (or other non-retriable error). Assuming we use the logic that your propose, kube-apiserver will continue to retry for the duration of the Poll timeout which seems unnecessary to me.
So I see a couple of choices here:
- Retry until we get either true or poll timeout expires
- Differentiate between 500(return an error) and 503 (return false, nil)
WDYT?
There was a problem hiding this comment.
Special-casing 500 and causing it to abort retries immediately sounds fine.
I'm not sure what that achieves though.
It'll cause kube-apiserver to crash and be restarted a few seconds later anyway.
Retries will continue indefinitely (indirectly through kube-apiserver restarts) in any case.
There was a problem hiding this comment.
@awly PTAL - added a case for 500.
I agree that this may not make a lot of difference. However, I don't want to make any assumptions about how kube-apiserver will be restarted (i.e. its restart policy) or whether or not kube-apiserver should crash if kms-plugin is down.
My objective is to stop polling if KMS-Plugin returned a non-retryable error.
There was a problem hiding this comment.
SGTM, as long as there's some way for KMS plugin implementors to discover this expectation from /healthz
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
Outdated
Show resolved
Hide resolved
|
/assign @liggitt |
|
/assign @logicalhan |
|
/test pull-kubernetes-dependencies |
39975f8 to
7a0e4d2
Compare
7a0e4d2 to
16ad233
Compare
staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go
Outdated
Show resolved
Hide resolved
b37a5df to
0a01687
Compare
|
/test pull-kubernetes-e2e-gce-100-performance |
There was a problem hiding this comment.
this seems more like v5, if at all
There was a problem hiding this comment.
we should include the name in the returned error so it is surfaced in healthz details. if we do that, we don't need to separately log the error here. same comment applies to the decrypt call on line 136
There was a problem hiding this comment.
Done, included provider's name into the error message.
There was a problem hiding this comment.
the h captured here by the closure is incorrect, it means if you have multiple checks, they will all test the last iterated config
There was a problem hiding this comment.
add a test with at least two configs and verify the checks exercise the correct KMS
There was a problem hiding this comment.
isn't the first KMS in the list used to encrypt, and subsequent ones to decrypt? would we ever encounter a case where a later KMS would only permit decrypting, so the encrypt("ping") check would fail?
There was a problem hiding this comment.
Fixed the closure bug.
Added integration test for multiple providers' scenario (see, kms_treansformer_test.go TestHealthz.
Yes, theoretically, the scenario you described in the third comment is possible. KMS Administrator may remove "encrypt" IAM privilege from the service account under which the kms-plugin runs once the corresponding provider is moved into the second place - used only for decryption.
In practice, this is unlikely, but we should document this.
Also, there is no generic way to test decryption without providing a valid encryption payload.
I think the long term solution here is to move towards gRPC Health Checking Protocol
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
KMS-Plugin developers may have vendor specific way to ascertain the status of the plugin based on the requested operation type. This is something I would like to add after this PR.
There was a problem hiding this comment.
I wouldn't expect a log here, or if we do, at v(5)
fec4e6c to
e033d95
Compare
|
/test pull-kubernetes-integration |
|
looks relevant |
e033d95 to
d395fe1
Compare
|
/test pull-kubernetes-integration |
|
@liggitt Yes, it was relevant - I was missing a call to cleanup Test API Server - fixed. |
liggitt
left a comment
There was a problem hiding this comment.
/approve
go ahead and squash commits, and fix a couple import and logging nits while you're at it, then lgtm
There was a problem hiding this comment.
do we log on other health check success? wondering when this would be useful
There was a problem hiding this comment.
group google.golang.org imports together
073e4e5 to
d9af661
Compare
d9af661 to
05fdbb2
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: immutableT, liggitt The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
When kube-apiserver starts before kms-plugin, kube-apiserver essentially is unable to serve secrets until kms-plugin is up. This inability to process secrets (just after being started) leads to large number of errors and eventually to a restart of kube-apiserver.
To avoid this untested scenario, KMS Provider will install a healthz check on kube-apiserver for the status of kms-plugin.
Therefore, (via a readiness probe) kube-apiserver will be protected from serving requests until kms-plugin becomes available. Thus allowing kube-apiserver properly handle requests for Secrets and Service Accounts.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: