Add healthz check for KMS Providers on kube-apiserver. by immutableT · Pull Request #78540 · kubernetes/kubernetes

immutableT · 2019-05-30T18:42:47Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
When kube-apiserver starts before kms-plugin, kube-apiserver essentially is unable to serve secrets until kms-plugin is up. This inability to process secrets (just after being started) leads to large number of errors and eventually to a restart of kube-apiserver.
To avoid this untested scenario, KMS Provider will install a healthz check on kube-apiserver for the status of kms-plugin.
Therefore, (via a readiness probe) kube-apiserver will be protected from serving requests until kms-plugin becomes available. Thus allowing kube-apiserver properly handle requests for Secrets and Service Accounts.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

KMS Providers will install a healthz check for the status of kms-pluign in kube-apiservers' encryption config.

immutableT · 2019-05-30T18:44:55Z

/assign @awly

fejta-bot · 2019-05-30T18:51:11Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

immutableT · 2019-05-30T20:57:56Z

/test pull-kubernetes-bazel-build

immutableT · 2019-05-30T23:12:37Z

/test pull-kubernetes-bazel-test

awly · 2019-05-31T16:31:24Z

staging/src/k8s.io/apiserver/pkg/apis/config/types.go

Could healthz probes be passed over the plugin Endpoint above?

I assume that you meant to create a dedicated type for healthz probe. If I understood you correctly, the latest commit accomplishes this.

I meant adding a healthz RPC to the KMS plugin itself on the gRPC service.
So you could send those healthz requests over KMSConfiguration.Endpoint (the unix socket) instead of a separate new HTTP endpoint

I think this is a good idea, but my survey of the currently implemented plugins shows that they do not have this functionality yet.
This is something we could add later.

Do currently implemented plugins already expose a HTTP endpoint for healthz?
If not, you could put in a healthz grpc endpoint and just skip polling health if it returns "unimplemented" (or whatever error grpc will send).

Yes, Google CloudKMS Plugin and AWS CloudHSM plugins have implemented healthz. Hence, these plugins could take advantage of this feature right away.

Ack, that makes sense then

awly · 2019-05-31T16:31:49Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

remove empty line

awly · 2019-05-31T16:34:19Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

So this only happens on kube-apiserver startup, right?
What if kms plugin becomes unhealthy later on? I assume it will just show up as errors from all Secrets operations?

Yes, this is only to ensure that kube-apiserver comes-up after kms-plugin is reporting healthz OK.
KMS-Plugin may have its own liveness probe.
And yes, kube-apiserver will log and increment failure metrics when the plugin is down.

awly · 2019-05-31T16:35:59Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

Could this only return the error? If arg parsing fails or if healthz polling fails, return a non-nil error

Maybe, but I want to deffirentate between the scenarios

Healthz could not be checked due to configuration errors - false, error

Heathz returned non 200 status - false,nil

Healthz return 200 - true,nil

I think this approach fits with the signature of the poll.Wait.

Hmm, but shouldn't you retry even if healthz endpoint is unreachable?
For example when KMS plugin starts later than kube-apiserver.
If yes, then there's no difference between non-200 status and any request errors

Let me step back.
wait.PollImmediate takes a condition function of the following signature:
type ConditionFunc func() (done bool, err error)
Therefore, getHelathz must match that signature.

I think the reason for that signature is that polling status has three potential states:

true - we are done ,no need to retry further (kms-plugin OK)

false - poll failed, keep retrying (kms-plugin is not up yet)

error - got negative status, no need to poll anymore (ex. missing permissions on the key)

getHealthz doesn't have to conform to what wait.PollImmediate expects, you can always adapt it with a closure.

I understand the purpose of those separate return values, but my point is: getHealthz should never cause wait.PollImmediate to exit early with an error.
Currently you return non-nil error when response code is not 200. I think kube-apiserver should keep retrying in that case, let the plugin flip response code to 200 later on.
Then KMS plugin can say "i'm up, but not ready to serve requests yet" and kube-apiserver won't crashloop because of it.

ACK on the signature, you are correct.

Here a scenario:
kube-apiserver polls kms-plugin and the plugin responds with 500 because KMS returned "Key Destroyed Error" (or other non-retriable error). Assuming we use the logic that your propose, kube-apiserver will continue to retry for the duration of the Poll timeout which seems unnecessary to me.

So I see a couple of choices here:

Retry until we get either true or poll timeout expires

Differentiate between 500(return an error) and 503 (return false, nil)

WDYT?

Special-casing 500 and causing it to abort retries immediately sounds fine.
I'm not sure what that achieves though.
It'll cause kube-apiserver to crash and be restarted a few seconds later anyway.

Retries will continue indefinitely (indirectly through kube-apiserver restarts) in any case.

@awly PTAL - added a case for 500.

I agree that this may not make a lot of difference. However, I don't want to make any assumptions about how kube-apiserver will be restarted (i.e. its restart policy) or whether or not kube-apiserver should crash if kms-plugin is down.
My objective is to stop polling if KMS-Plugin returned a non-retryable error.

SGTM, as long as there's some way for KMS plugin implementors to discover this expectation from /healthz

awly · 2019-05-31T16:37:22Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

c := &http.Client{...

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go

immutableT · 2019-05-31T18:49:35Z

/assign @liggitt

fedebongio · 2019-05-31T20:22:51Z

/assign @logicalhan

immutableT · 2019-06-01T03:35:15Z

/test pull-kubernetes-dependencies

staging/src/k8s.io/apiserver/pkg/apis/config/types.go

staging/src/k8s.io/apiserver/pkg/apis/config/v1/types.go

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

immutableT · 2019-06-22T00:45:32Z

/test pull-kubernetes-e2e-gce-100-performance

test/integration/master/transformation_testcase.go

liggitt · 2019-06-28T16:27:45Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

this seems more like v5, if at all

liggitt · 2019-06-28T16:28:57Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

we should include the name in the returned error so it is surfaced in healthz details. if we do that, we don't need to separately log the error here. same comment applies to the decrypt call on line 136

Done, included provider's name into the error message.

liggitt · 2019-06-28T16:30:01Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

the h captured here by the closure is incorrect, it means if you have multiple checks, they will all test the last iterated config

add a test with at least two configs and verify the checks exercise the correct KMS

isn't the first KMS in the list used to encrypt, and subsequent ones to decrypt? would we ever encounter a case where a later KMS would only permit decrypting, so the encrypt("ping") check would fail?

Fixed the closure bug.
Added integration test for multiple providers' scenario (see, kms_treansformer_test.go TestHealthz.

Yes, theoretically, the scenario you described in the third comment is possible. KMS Administrator may remove "encrypt" IAM privilege from the service account under which the kms-plugin runs once the corresponding provider is moved into the second place - used only for decryption.
In practice, this is unlikely, but we should document this.
Also, there is no generic way to test decryption without providing a valid encryption payload.
I think the long term solution here is to move towards gRPC Health Checking Protocol
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
KMS-Plugin developers may have vendor specific way to ascertain the status of the plugin based on the requested operation type. This is something I would like to add after this PR.

liggitt · 2019-06-28T18:12:48Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

I wouldn't expect a log here, or if we do, at v(5)

liggitt · 2019-06-28T18:13:01Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

no logging here?

immutableT · 2019-07-01T23:43:05Z

/test pull-kubernetes-integration

liggitt · 2019-07-02T04:36:04Z

looks relevant

E0702 00:03:19.695781  107386 grpc_service.go:71] failed to create connection to unix socket: @kms-provider-2.sock, error: dial unix @kms-provider-2.sock: connect: connection refused
W0702 00:03:19.695822  107386 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {@kms-provider-2.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix @kms-provider-2.sock: connect: connection refused". Reconnecting...
panic: Conflicting storage tracking

immutableT · 2019-07-02T15:12:54Z

/test pull-kubernetes-integration
/test pull-kubernetes-bazel-build

immutableT · 2019-07-02T15:46:58Z

@liggitt Yes, it was relevant - I was missing a call to cleanup Test API Server - fixed.

liggitt

/approve

go ahead and squash commits, and fix a couple import and logging nits while you're at it, then lgtm

liggitt · 2019-07-03T15:18:06Z

staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config.go

do we log on other health check success? wondering when this would be useful

liggitt · 2019-07-03T15:18:18Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go

group imports together

liggitt · 2019-07-03T15:18:33Z

test/integration/master/kms_plugin_mock.go

group google.golang.org imports together

liggitt · 2019-07-03T18:42:22Z

/lgtm
/approve

k8s-ci-robot · 2019-07-03T18:45:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: immutableT, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [liggitt]
~~staging/src/k8s.io/apiserver/pkg/apis/OWNERS~~ [liggitt]
~~test/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from wojtek-t and yujuhong May 30, 2019 18:44

k8s-ci-robot assigned awly May 30, 2019

awly reviewed May 31, 2019

View reviewed changes

k8s-ci-robot assigned liggitt May 31, 2019

k8s-ci-robot assigned logicalhan May 31, 2019

immutableT force-pushed the kms-plugin-healthz-check branch from 39975f8 to 7a0e4d2 Compare June 1, 2019 17:02

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. area/dependency Issues or PRs related to dependency changes labels Jun 1, 2019

immutableT force-pushed the kms-plugin-healthz-check branch from 7a0e4d2 to 16ad233 Compare June 1, 2019 17:49

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 1, 2019

liggitt reviewed Jun 4, 2019

View reviewed changes

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2019

immutableT changed the title ~~Allow KMS Provider to install a healthz check on kube-apiserver for the status of kms-plugin.~~ Add healthz check for KMS Providers on kube-apiserver. Jun 21, 2019

immutableT force-pushed the kms-plugin-healthz-check branch 3 times, most recently from b37a5df to 0a01687 Compare June 21, 2019 23:55

liggitt reviewed Jun 26, 2019

View reviewed changes

test/integration/master/transformation_testcase.go Outdated Show resolved Hide resolved

liggitt reviewed Jun 28, 2019

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 1, 2019

immutableT force-pushed the kms-plugin-healthz-check branch from fec4e6c to e033d95 Compare July 1, 2019 22:22

immutableT force-pushed the kms-plugin-healthz-check branch from e033d95 to d395fe1 Compare July 2, 2019 05:35

liggitt reviewed Jul 3, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2019

immutableT force-pushed the kms-plugin-healthz-check branch from 073e4e5 to d9af661 Compare July 3, 2019 15:33

Allow kube-apiserver to test the status of kms-plugin.

05fdbb2

immutableT force-pushed the kms-plugin-healthz-check branch from d9af661 to 05fdbb2 Compare July 3, 2019 17:03

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2019

k8s-ci-robot merged commit a7cde2e into kubernetes:master Jul 3, 2019

liggitt added this to the v1.16 milestone Aug 6, 2019

immutableT mentioned this pull request Oct 18, 2019

Fix #17014 - Update k8s version info for etcd encryption feature in securing-a-clu… kubernetes/website#17027

Merged

ritazh mentioned this pull request Aug 4, 2020

Add healthz Azure/kubernetes-kms#62

Closed

2 tasks

Conversation

immutableT commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

immutableT commented May 30, 2019

Uh oh!

fejta-bot commented May 30, 2019

Uh oh!

immutableT commented May 30, 2019

Uh oh!

immutableT commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

immutableT commented May 31, 2019

Uh oh!

fedebongio commented May 31, 2019

Uh oh!

immutableT commented Jun 1, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

immutableT commented Jun 22, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

immutableT Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

immutableT commented May 30, 2019 •

edited

Loading

immutableT Jul 1, 2019 •

edited

Loading

immutableT Jul 1, 2019 •

edited

Loading