Skip to content

Add healthz check for KMS Providers on kube-apiserver.#78540

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
immutableT:kms-plugin-healthz-check
Jul 3, 2019
Merged

Add healthz check for KMS Providers on kube-apiserver.#78540
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
immutableT:kms-plugin-healthz-check

Conversation

@immutableT
Copy link
Copy Markdown
Contributor

@immutableT immutableT commented May 30, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
When kube-apiserver starts before kms-plugin, kube-apiserver essentially is unable to serve secrets until kms-plugin is up. This inability to process secrets (just after being started) leads to large number of errors and eventually to a restart of kube-apiserver.
To avoid this untested scenario, KMS Provider will install a healthz check on kube-apiserver for the status of kms-plugin.
Therefore, (via a readiness probe) kube-apiserver will be protected from serving requests until kms-plugin becomes available. Thus allowing kube-apiserver properly handle requests for Secrets and Service Accounts.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

KMS Providers will install a healthz check for the status of kms-pluign in kube-apiservers' encryption config. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2019
@k8s-ci-robot k8s-ci-robot requested review from wojtek-t and yujuhong May 30, 2019 18:44
@immutableT
Copy link
Copy Markdown
Contributor Author

/assign @awly

@fejta-bot
Copy link
Copy Markdown

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-bazel-build

@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-bazel-test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could healthz probes be passed over the plugin Endpoint above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you meant to create a dedicated type for healthz probe. If I understood you correctly, the latest commit accomplishes this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant adding a healthz RPC to the KMS plugin itself on the gRPC service.
So you could send those healthz requests over KMSConfiguration.Endpoint (the unix socket) instead of a separate new HTTP endpoint

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, but my survey of the currently implemented plugins shows that they do not have this functionality yet.
This is something we could add later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do currently implemented plugins already expose a HTTP endpoint for healthz?
If not, you could put in a healthz grpc endpoint and just skip polling health if it returns "unimplemented" (or whatever error grpc will send).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Google CloudKMS Plugin and AWS CloudHSM plugins have implemented healthz. Hence, these plugins could take advantage of this feature right away.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, that makes sense then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only happens on kube-apiserver startup, right?
What if kms plugin becomes unhealthy later on? I assume it will just show up as errors from all Secrets operations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is only to ensure that kube-apiserver comes-up after kms-plugin is reporting healthz OK.
KMS-Plugin may have its own liveness probe.
And yes, kube-apiserver will log and increment failure metrics when the plugin is down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this only return the error? If arg parsing fails or if healthz polling fails, return a non-nil error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but I want to deffirentate between the scenarios

  1. Healthz could not be checked due to configuration errors - false, error
  2. Heathz returned non 200 status - false,nil
  3. Healthz return 200 - true,nil

I think this approach fits with the signature of the poll.Wait.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but shouldn't you retry even if healthz endpoint is unreachable?
For example when KMS plugin starts later than kube-apiserver.
If yes, then there's no difference between non-200 status and any request errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me step back.
wait.PollImmediate takes a condition function of the following signature:
type ConditionFunc func() (done bool, err error)
Therefore, getHelathz must match that signature.

I think the reason for that signature is that polling status has three potential states:

  1. true - we are done ,no need to retry further (kms-plugin OK)
  2. false - poll failed, keep retrying (kms-plugin is not up yet)
  3. error - got negative status, no need to poll anymore (ex. missing permissions on the key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getHealthz doesn't have to conform to what wait.PollImmediate expects, you can always adapt it with a closure.

I understand the purpose of those separate return values, but my point is: getHealthz should never cause wait.PollImmediate to exit early with an error.
Currently you return non-nil error when response code is not 200. I think kube-apiserver should keep retrying in that case, let the plugin flip response code to 200 later on.
Then KMS plugin can say "i'm up, but not ready to serve requests yet" and kube-apiserver won't crashloop because of it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK on the signature, you are correct.

Here a scenario:
kube-apiserver polls kms-plugin and the plugin responds with 500 because KMS returned "Key Destroyed Error" (or other non-retriable error). Assuming we use the logic that your propose, kube-apiserver will continue to retry for the duration of the Poll timeout which seems unnecessary to me.

So I see a couple of choices here:

  1. Retry until we get either true or poll timeout expires
  2. Differentiate between 500(return an error) and 503 (return false, nil)

WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special-casing 500 and causing it to abort retries immediately sounds fine.
I'm not sure what that achieves though.
It'll cause kube-apiserver to crash and be restarted a few seconds later anyway.

Retries will continue indefinitely (indirectly through kube-apiserver restarts) in any case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awly PTAL - added a case for 500.

I agree that this may not make a lot of difference. However, I don't want to make any assumptions about how kube-apiserver will be restarted (i.e. its restart policy) or whether or not kube-apiserver should crash if kms-plugin is down.
My objective is to stop polling if KMS-Plugin returned a non-retryable error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, as long as there's some way for KMS plugin implementors to discover this expectation from /healthz

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c := &http.Client{...

@immutableT
Copy link
Copy Markdown
Contributor Author

/assign @liggitt

@fedebongio
Copy link
Copy Markdown
Contributor

/assign @logicalhan

@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-dependencies

@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from 39975f8 to 7a0e4d2 Compare June 1, 2019 17:02
@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. area/dependency Issues or PRs related to dependency changes labels Jun 1, 2019
@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from 7a0e4d2 to 16ad233 Compare June 1, 2019 17:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 1, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2019
@immutableT immutableT changed the title Allow KMS Provider to install a healthz check on kube-apiserver for the status of kms-plugin. Add healthz check for KMS Providers on kube-apiserver. Jun 21, 2019
@immutableT immutableT force-pushed the kms-plugin-healthz-check branch 3 times, most recently from b37a5df to 0a01687 Compare June 21, 2019 23:55
@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems more like v5, if at all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should include the name in the returned error so it is surfaced in healthz details. if we do that, we don't need to separately log the error here. same comment applies to the decrypt call on line 136

Copy link
Copy Markdown
Contributor Author

@immutableT immutableT Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, included provider's name into the error message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the h captured here by the closure is incorrect, it means if you have multiple checks, they will all test the last iterated config

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test with at least two configs and verify the checks exercise the correct KMS

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the first KMS in the list used to encrypt, and subsequent ones to decrypt? would we ever encounter a case where a later KMS would only permit decrypting, so the encrypt("ping") check would fail?

Copy link
Copy Markdown
Contributor Author

@immutableT immutableT Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the closure bug.
Added integration test for multiple providers' scenario (see, kms_treansformer_test.go TestHealthz.

Yes, theoretically, the scenario you described in the third comment is possible. KMS Administrator may remove "encrypt" IAM privilege from the service account under which the kms-plugin runs once the corresponding provider is moved into the second place - used only for decryption.
In practice, this is unlikely, but we should document this.
Also, there is no generic way to test decryption without providing a valid encryption payload.
I think the long term solution here is to move towards gRPC Health Checking Protocol
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
KMS-Plugin developers may have vendor specific way to ascertain the status of the plugin based on the requested operation type. This is something I would like to add after this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect a log here, or if we do, at v(5)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no logging here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 1, 2019
@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from fec4e6c to e033d95 Compare July 1, 2019 22:22
@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jul 2, 2019

looks relevant

E0702 00:03:19.695781  107386 grpc_service.go:71] failed to create connection to unix socket: @kms-provider-2.sock, error: dial unix @kms-provider-2.sock: connect: connection refused
W0702 00:03:19.695822  107386 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {@kms-provider-2.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix @kms-provider-2.sock: connect: connection refused". Reconnecting...
panic: Conflicting storage tracking

@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from e033d95 to d395fe1 Compare July 2, 2019 05:35
@immutableT
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration
/test pull-kubernetes-bazel-build

@immutableT
Copy link
Copy Markdown
Contributor Author

@liggitt Yes, it was relevant - I was missing a call to cleanup Test API Server - fixed.

Copy link
Copy Markdown
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

go ahead and squash commits, and fix a couple import and logging nits while you're at it, then lgtm

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we log on other health check success? wondering when this would be useful

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group imports together

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group google.golang.org imports together

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2019
@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from 073e4e5 to d9af661 Compare July 3, 2019 15:33
@immutableT immutableT force-pushed the kms-plugin-healthz-check branch from d9af661 to 05fdbb2 Compare July 3, 2019 17:03
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jul 3, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: immutableT, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants