Add readiness probe for pilot agent.#7465
Conversation
|
I'm sending out the PR a little early because I'd like to get feedback on the general approach. We still need to determine whether we should allow opt-in/out, what should be the default, and how to configure it (annotations?). We can use this issue for discussion. |
pilot/pkg/kube/inject/inject.go
Outdated
There was a problem hiding this comment.
? Do we need to grab another port? Also, this is probably colliding with the http proxy
There was a problem hiding this comment.
We need a port - and we should start a list documenting all ports used by istio. Yes, 15003 is used by the http proxy ( which we still support and is needed in some environments )
There was a problem hiding this comment.
Actually - we don't have a default port for http_proxy, and 15003 was used in an old version for pilot v1. We can keep it as 15003 - but should still have an 'assigned ports' list somewhere in the wiki.
There was a problem hiding this comment.
Filed #7699 to address the documentation.
pilot/pkg/kube/inject/inject.go
Outdated
There was a problem hiding this comment.
I don't think you need any of this logic. Use an exec based command and run curl localhost: . This will eliminate the need to add another exception port, etc.
There was a problem hiding this comment.
Please, no curl and exec. We already use 15000 and 15003 - adding a third one is not a huge deal.
We need it for metrics as well, and it's more efficient and flexible and works in other environments too. ( raw vm with consul and many others also have liveness checks ).
pilot/pkg/kube/inject/mesh.go
Outdated
There was a problem hiding this comment.
exec based instead of httpget.
rshriram
left a comment
There was a problem hiding this comment.
I strongly suggest removing the httpGet and making the health check exec based, on a local port. Otherwise, we have all sorts of security issues popping up.
Second, I think this PR should go into 1.0, given that the lack of a health check is a common pain most users experience. If the tests pass, i see no reason to not include this in 1.0, as it improves stability. @costinm thoughts?
pilot/cmd/pilot-agent/health.go
Outdated
There was a problem hiding this comment.
Maybe more than log. Exit would probably be right - readiness would fail anyways.
pilot/cmd/pilot-agent/health.go
Outdated
There was a problem hiding this comment.
should this just be /healthz?
There was a problem hiding this comment.
Actually, as described in the health checking doc, it should be /healthz/ready.
pilot/pkg/kube/inject/inject_test.go
Outdated
costinm
left a comment
There was a problem hiding this comment.
Only got trough the first half.
Main concern is that adding a 2 second delay on all pods in the mesh, and a readiness probe by default is far too broad and damaging.
I still need to understand how 'ready' is determined - at least having the inbound listeners is required to be really ready, and this should also work for 'health'.
There was a problem hiding this comment.
A few reasons:
- It helps to simplify the template logic
- Its consistent with some other things that I'm doing where calling a method is doing things that are non-trivial to do in the template.
- It removes the trailing
,which is incorrect and may or may not have undesired affects in the script.
There was a problem hiding this comment.
We also need health probe. And maybe use /readyz for readiness and leave healthz for health ?
There was a problem hiding this comment.
health probes are not covered by this PR. As discussed earlier, readiness will be /healthz/ready.
There was a problem hiding this comment.
Delaying each pod with 2 seconds may be pretty bad for many apps.
Also not sure the global setting evaluated only at install time is going to help ( changing
mesh config after install will have no effect, even with restart of pilot and sidecar injector - it is evaluated at install by helm ).
istioctl/cmd/istioctl/kubeinject.go
Outdated
There was a problem hiding this comment.
No extra flags for istioctl. Keep feature parity with sidecar injection.
Users should add an annotation to the pod to customize per-pod settings.
There was a problem hiding this comment.
Why no extra flags? Can you point to a feature that is configured in a way other than flags that I can use as a reference?
There was a problem hiding this comment.
istioctl is fetching the configmap. We can add those to the configmap if you want - the envoy port is also set there AFAIK.
It is fine to use flags for things users should control - but almost all of those flags will just confuse user.
There are very rare cases when a user will need to touch those - and the annotations are sufficient for that.
There was a problem hiding this comment.
Besides - the readiness path is a const in the first place ( and should be - we should only add options when there is a clear use case).
There was a problem hiding this comment.
Moving to the configmap seems reasonable ... not against it. Seems like all of this should move there, no?
The only issue I see is that wouldn't allow customizing the injection for a single deployment, right?
pilot/cmd/pilot-agent/health.go
Outdated
There was a problem hiding this comment.
Maybe more than log. Exit would probably be right - readiness would fail anyways.
pilot/cmd/pilot-agent/main.go
Outdated
There was a problem hiding this comment.
Please don't add a dependency on the entire package ( which AFAIK will link all symbols) just to get a constant.
It is ok to just duplicate the constant value.
pilot/pkg/kube/inject/inject.go
Outdated
There was a problem hiding this comment.
I like this - will simplify the template.
We should probably add more - in particular around the config options, so we can have a clean 'default to
the current mesh config ( not the install time values ), with annotation-based overrides.
There was a problem hiding this comment.
I'll leave that to follow-on work, as this PR is already huge.
|
And far too risky and too late for 1.0, but this should be a priority for 1.1 |
pilot/pkg/proxy/envoy/readiness.go
Outdated
There was a problem hiding this comment.
This will be linked in agent - we don't want to pull all ads stuff and deps and create confusion. Agent-linked code needs to be separate package ( we also need to untangle it from the other packages ).
pilot/pkg/proxy/envoy/readiness.go
Outdated
There was a problem hiding this comment.
I would rely on stats only if possible.
There was a problem hiding this comment.
What is your concern with config dump?
|
Issues with exec probes:
network probes address all these problems. |
Codecov Report
@@ Coverage Diff @@
## master #7465 +/- ##
=======================================
- Coverage 72% 71% -<1%
=======================================
Files 379 374 -5
Lines 33570 33071 -499
=======================================
- Hits 23930 23285 -645
- Misses 8568 8773 +205
+ Partials 1072 1013 -59
Continue to review full report at Codecov.
|
|
Yes, but the problem is also k8s centric.
I don't see how this is a problem with exec. This problem persists even if we use network probe
?? The container is running code. The exec runs in the context of the container. If curl is an issue, simply add the capability to pilot-agent itself.. /usr/bin/pilot-agent healthcheck
Curl is from the proxy container. This PR is about health checking envoy. Not daisy chaining to application's health check, because not all apps use http health check. They have TCP checks as well. They also have exec based checks when they have to check for multiple things. Net: except for 1, I don't see how network probes solve anything. I would suggest simply making pilot agent listen on a unix domain socket and running the pilot-agent healthcheck command which talks to the UDS to check if the agent is running and healthy. And run this as exec command. Opening a port in Envoy and then establishing a special lsitener to route it back to pilot agent, etc. is making the envoy setup more complicated. If you re-use code from proxy-status command, yuo could even do a smarter check where the agent can ensure the envoy has had atleast one successful ADS request (by checking envoy stats). In fact, that should be the health check instead of curling admin endpoint. |
makes sense.
Discussed offline. The doc you're referring to is here
IIUC, this is how readiness works in k8s already (i.e. traffic won't be received by the pod until all containers are "ready"). So adding just a readiness probe to the proxy container should be sufficient for this PR. Application-level readiness is up to the developer. |
|
@liamawhite good point. In your example, there were already a few inbound clusters present. Just the important inbound cluster was missing. |
|
@mandarjog @liamawhite In this case, are you saying that only a subset of the clusters were received from an ADS update? Or was everything received, but not all clusters/listeners were warmed? |
|
All things have been received as far as Envoy/Pilot are concerned. However, Kubernetes hasn't yet added the endpoints so Pilot doesn't know it needs to add the correct inbound clusters. |
|
@liamawhite got it ... makes sense. And good catch! Should be an easy fix. I'm not sure if I can rely on @PiotrSikora can you confirm that using |
|
@mandarjog @liamawhite @PiotrSikora PTAL. I've added the exposed application ports to the readiness check. |
There was a problem hiding this comment.
Can we keep the logic in istio.io/pilot/pkg? Maybe in the proxy directory?
There was a problem hiding this comment.
I had it there originally, but there was a concern about proxy-agent pulling in too many dependencies. I suppose so long as it's in its own package it should be fine. Do you have a preference of location?
|
@liamawhite k8s should be equally slow for inbound and outbound clusters, what do you mean? There was also a fix to generate listeners for endpoints not ready #6992 |
|
@andraxylia the requested check has already been added to this PR. See https://github.com/istio/istio/pull/7465/files#diff-293528584c1dc78a315e29fb2df2bc1a |
|
@andraxylia In some cases, it takes Kube a while to add an endpoint for the pod. Until Kube adds that endpoint Pilot doesn't know which inbound clusters to add so the pod can't correctly handle incoming requests. |
|
For those interested, I've shared a design doc for Istio Proxy liveness/readiness. |
Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: costinm, mandarjog The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@nmittler: The following test failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/test e2e-mixer-no_auth |
Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.
* Cleaning up inject test files (#7188) There are a lot of files and its confusing which files are used by inject_test vs webhook_test. Splitting out the files into two subdirectories under testdata/ to make it clear. Making copies of files that are shared. * Adding helper functions to simplify injection templates (#7737) * Adding helper functions to simplify injection templates * addressing comments * More sidecar injector template cleanup (#7832) * More sidecar injector template cleanup * fixing * Add readiness probe for pilot agent. (#7465) Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.
Overview of the changes:
Adding new flag to the pilot agent "--statusPort". Defaults to 15020.
Adding an HttpGet readiness probe to the pilot-agent
on the new status port. The agent (not Envoy) will
open this port. When receiving an HTTP GET on the readiness path,
the agent will probe Envoy, checking that it has received dynamic
listeners and clusters. If so, it's "ready".
Poking a hole in the sidecar iptables for the new status port
(via the -d option).
Updated both webhook and manual injection paths to generate the
readiness probe. A lot of changes here for the test files.