Add readiness probe for pilot agent. by nmittler · Pull Request #7465 · istio/istio

nmittler · 2018-07-28T00:09:38Z

Overview of the changes:

Adding new flag to the pilot agent "--statusPort". Defaults to 15020.
Adding an HttpGet readiness probe to the pilot-agent
on the new status port. The agent (not Envoy) will
open this port. When receiving an HTTP GET on the readiness path,
the agent will probe Envoy, checking that it has received dynamic
listeners and clusters. If so, it's "ready".
Poking a hole in the sidecar iptables for the new status port
(via the -d option).
Updated both webhook and manual injection paths to generate the
readiness probe. A lot of changes here for the test files.

nmittler · 2018-07-28T00:11:07Z

I'm sending out the PR a little early because I'd like to get feedback on the general approach. We still need to determine whether we should allow opt-in/out, what should be the default, and how to configure it (annotations?). We can use this issue for discussion.

rshriram · 2018-07-29T21:24:13Z

pilot/pkg/kube/inject/inject.go

? Do we need to grab another port? Also, this is probably colliding with the http proxy

We need a port - and we should start a list documenting all ports used by istio. Yes, 15003 is used by the http proxy ( which we still support and is needed in some environments )

Actually - we don't have a default port for http_proxy, and 15003 was used in an old version for pilot v1. We can keep it as 15003 - but should still have an 'assigned ports' list somewhere in the wiki.

Filed #7699 to address the documentation.

rshriram · 2018-07-29T21:25:18Z

pilot/pkg/kube/inject/inject.go

I don't think you need any of this logic. Use an exec based command and run curl localhost: . This will eliminate the need to add another exception port, etc.

Please, no curl and exec. We already use 15000 and 15003 - adding a third one is not a huge deal.

We need it for metrics as well, and it's more efficient and flexible and works in other environments too. ( raw vm with consul and many others also have liveness checks ).

rshriram · 2018-07-29T21:26:46Z

pilot/pkg/kube/inject/mesh.go

exec based instead of httpget.

rshriram

I strongly suggest removing the httpGet and making the health check exec based, on a local port. Otherwise, we have all sorts of security issues popping up.

Second, I think this PR should go into 1.0, given that the lack of a health check is a common pain most users experience. If the tests pass, i see no reason to not include this in 1.0, as it improves stability. @costinm thoughts?

ijsnellf · 2018-07-30T13:51:54Z

pilot/cmd/pilot-agent/health.go

Maybe more than log. Exit would probably be right - readiness would fail anyways.

ijsnellf · 2018-07-30T13:52:55Z

pilot/cmd/pilot-agent/health.go

should this just be /healthz?

Actually, as described in the health checking doc, it should be /healthz/ready.

ijsnellf · 2018-07-30T13:54:59Z

pilot/pkg/kube/inject/inject_test.go

costinm

Only got trough the first half.

Main concern is that adding a 2 second delay on all pods in the mesh, and a readiness probe by default is far too broad and damaging.

I still need to understand how 'ready' is determined - at least having the inbound listeners is required to be really ready, and this should also work for 'health'.

costinm · 2018-07-30T15:00:01Z

install/kubernetes/helm/istio/templates/sidecar-injector-configmap.yaml

Why do you need this ?

A few reasons:

It helps to simplify the template logic

Its consistent with some other things that I'm doing where calling a method is doing things that are non-trivial to do in the template.

It removes the trailing , which is incorrect and may or may not have undesired affects in the script.

costinm · 2018-07-30T15:03:38Z

install/kubernetes/helm/istio/templates/sidecar-injector-configmap.yaml

We also need health probe. And maybe use /readyz for readiness and leave healthz for health ?

health probes are not covered by this PR. As discussed earlier, readiness will be /healthz/ready.

costinm · 2018-07-30T15:05:51Z

install/kubernetes/helm/istio/templates/sidecar-injector-configmap.yaml

Delaying each pod with 2 seconds may be pretty bad for many apps.
Also not sure the global setting evaluated only at install time is going to help ( changing
mesh config after install will have no effect, even with restart of pilot and sidecar injector - it is evaluated at install by helm ).

costinm · 2018-07-30T15:07:57Z

istioctl/cmd/istioctl/kubeinject.go

No extra flags for istioctl. Keep feature parity with sidecar injection.

Users should add an annotation to the pod to customize per-pod settings.

Why no extra flags? Can you point to a feature that is configured in a way other than flags that I can use as a reference?

istioctl is fetching the configmap. We can add those to the configmap if you want - the envoy port is also set there AFAIK.

It is fine to use flags for things users should control - but almost all of those flags will just confuse user.
There are very rare cases when a user will need to touch those - and the annotations are sufficient for that.

Besides - the readiness path is a const in the first place ( and should be - we should only add options when there is a clear use case).

Moving to the configmap seems reasonable ... not against it. Seems like all of this should move there, no?

The only issue I see is that wouldn't allow customizing the injection for a single deployment, right?

costinm · 2018-07-30T15:19:14Z

pilot/cmd/pilot-agent/health.go

Maybe more than log. Exit would probably be right - readiness would fail anyways.

costinm · 2018-07-30T15:21:15Z

pilot/cmd/pilot-agent/main.go

Please don't add a dependency on the entire package ( which AFAIK will link all symbols) just to get a constant.

It is ok to just duplicate the constant value.

costinm · 2018-07-30T15:23:04Z

pilot/pkg/kube/inject/inject.go

I like this - will simplify the template.
We should probably add more - in particular around the config options, so we can have a clean 'default to
the current mesh config ( not the install time values ), with annotation-based overrides.

I'll leave that to follow-on work, as this PR is already huge.

costinm · 2018-07-30T15:26:31Z

And far too risky and too late for 1.0, but this should be a priority for 1.1

costinm · 2018-07-30T15:30:34Z

pilot/pkg/proxy/envoy/readiness.go

This will be linked in agent - we don't want to pull all ads stuff and deps and create confusion. Agent-linked code needs to be separate package ( we also need to untangle it from the other packages ).

costinm · 2018-07-30T15:31:19Z

pilot/pkg/proxy/envoy/readiness.go

I would rely on stats only if possible.

What is your concern with config dump?

mandarjog · 2018-07-31T21:14:49Z

Issues with exec probes:

It is very k8s centric.
Running exec curl from the proxy container will mark the proxy container bad even though the app container needs restarting. Every container needs its own dedicated liveness / readiness checks for this reason.
You may not have the ability to execute code in the environment.
There is a work around by mounting a statically linked curl like binary into the application container. Even that may not always work if you are using alpine or similar.

network probes address all these problems.

codecov · 2018-07-31T22:24:22Z

Codecov Report

Merging #7465 into master will decrease coverage by 1%.
The diff coverage is 68%.

@@           Coverage Diff           @@
##           master   #7465    +/-   ##
=======================================
- Coverage      72%     71%   -<1%     
=======================================
  Files         379     374     -5     
  Lines       33570   33071   -499     
=======================================
- Hits        23930   23285   -645     
- Misses       8568    8773   +205     
+ Partials     1072    1013    -59

Impacted Files	Coverage Δ
istioctl/cmd/istioctl/kubeinject.go	`46% <40%> (+2%)`	⬆️
pilot/pkg/kube/inject/inject.go	`81% <90%> (+1%)`	⬆️
...k/components/apps/local/envoy/agent/pilot/agent.go	`2% <0%> (-63%)`	⬇️
pkg/mcp/creds/create.go	`0% <0%> (-55%)`	⬇️
pilot/pkg/proxy/envoy/watcher.go	`57% <0%> (-31%)`	⬇️
pkg/mcp/creds/watcher.go	`68% <0%> (-19%)`	⬇️
...t/pkg/serviceregistry/external/servicediscovery.go	`79% <0%> (-11%)`	⬇️
mixer/pkg/runtime/handler/env.go	`85% <0%> (-10%)`	⬇️
mixer/pkg/attribute/mutableBag.go	`90% <0%> (-9%)`	⬇️
pilot/pkg/serviceregistry/aggregate/controller.go	`69% <0%> (-7%)`	⬇️
... and 117 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15e4901...842c111. Read the comment docs.

mandarjog · 2018-07-31T22:42:52Z

Update /health to /health/liveness and /health/readiness
Per design doc we actually need /health/<container>/liveness, both can be services by agent or proxy.
For readiness, we do want it to be AND of proxy ready and application ready.
We don't want to start receiving traffic when only proxy is ready but application is not.

rshriram · 2018-08-01T17:23:29Z

Issues with exec probes:

It is very k8s centric.

Yes, but the problem is also k8s centric.

Running exec curl from the proxy container will mark the proxy container bad even though the app container needs restarting. Every container needs its own dedicated liveness / readiness checks for this reason.

I don't see how this is a problem with exec. This problem persists even if we use network probe

You may not have the ability to execute code in the environment.

?? The container is running code. The exec runs in the context of the container. If curl is an issue, simply add the capability to pilot-agent itself.. /usr/bin/pilot-agent healthcheck
and let the command talk over a local unix domain socket to the running pilot agent and see if things are working.

There is a work around by mounting a statically linked curl like binary into the application container.
Even that may not always work if you are using alpine or similar.

Curl is from the proxy container. This PR is about health checking envoy. Not daisy chaining to application's health check, because not all apps use http health check. They have TCP checks as well. They also have exec based checks when they have to check for multiple things.

Net: except for 1, I don't see how network probes solve anything. I would suggest simply making pilot agent listen on a unix domain socket and running the pilot-agent healthcheck command which talks to the UDS to check if the agent is running and healthy. And run this as exec command. Opening a port in Envoy and then establishing a special lsitener to route it back to pilot agent, etc. is making the envoy setup more complicated.

If you re-use code from proxy-status command, yuo could even do a smarter check where the agent can ensure the envoy has had atleast one successful ADS request (by checking envoy stats). In fact, that should be the health check instead of curling admin endpoint.

nmittler · 2018-08-01T17:39:26Z

@mandarjog

Update /health to /health/liveness and /health/readiness

makes sense.

Per design doc we actually need /health/<container>/liveness, both can be services by agent or proxy.

Discussed offline. The doc you're referring to is here

For readiness, we do want it to be AND of proxy ready and application ready.
We don't want to start receiving traffic when only proxy is ready but application is not.

IIUC, this is how readiness works in k8s already (i.e. traffic won't be received by the pod until all containers are "ready"). So adding just a readiness probe to the proxy container should be sufficient for this PR. Application-level readiness is up to the developer.

mandarjog · 2018-08-24T12:38:31Z

@liamawhite good point. In your example, there were already a few inbound clusters present. Just the important inbound cluster was missing.
In order to detect this, we should wait for all inbound clusters corresponding to declared container ports to show up.

nmittler · 2018-08-24T15:52:40Z

@mandarjog @liamawhite In this case, are you saying that only a subset of the clusters were received from an ADS update? Or was everything received, but not all clusters/listeners were warmed?

liamawhite · 2018-08-24T16:05:10Z

All things have been received as far as Envoy/Pilot are concerned. However, Kubernetes hasn't yet added the endpoints so Pilot doesn't know it needs to add the correct inbound clusters.

nmittler · 2018-08-24T16:10:28Z

@liamawhite got it ... makes sense. And good catch! Should be an easy fix.

I'm not sure if I can rely on /clusters because the output looks like it's a subset of /stats. There is currently some discussion regarding eliminating clusters from stats to reduce the memory footprint of Envoy. However, the output of /listeners might be an option here.

@PiotrSikora can you confirm that using /listeners will be reasonably future-proof? I suspect that config_dump might be processing intensive, so I'd rather avoid it if possible.

nmittler · 2018-08-24T19:59:34Z

@mandarjog @liamawhite @PiotrSikora PTAL. I've added the exposed application ports to the readiness check.

liamawhite · 2018-08-25T18:34:28Z

pilot/cmd/pilot-agent/status/ready/listeners.go

Can we keep the logic in istio.io/pilot/pkg? Maybe in the proxy directory?

I had it there originally, but there was a concern about proxy-agent pulling in too many dependencies. I suppose so long as it's in its own package it should be fine. Do you have a preference of location?

andraxylia · 2018-08-25T22:31:15Z

@liamawhite k8s should be equally slow for inbound and outbound clusters, what do you mean? There was also a fix to generate listeners for endpoints not ready #6992

nmittler · 2018-08-26T15:36:45Z

@andraxylia the requested check has already been added to this PR. See https://github.com/istio/istio/pull/7465/files#diff-293528584c1dc78a315e29fb2df2bc1a

liamawhite · 2018-08-27T20:29:16Z

@andraxylia In some cases, it takes Kube a while to add an endpoint for the pod. Until Kube adds that endpoint Pilot doesn't know which inbound clusters to add so the pod can't correctly handle incoming requests.

nmittler · 2018-08-28T18:05:39Z

For those interested, I've shared a design doc for Istio Proxy liveness/readiness.

Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.

costinm

/lgtm
/approve

mandarjog · 2018-08-31T18:35:02Z

/lgtm

istio-testing · 2018-08-31T18:35:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: costinm, mandarjog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~install/OWNERS~~ [costinm,mandarjog]
~~istioctl/OWNERS~~ [costinm,mandarjog]
~~pilot/OWNERS~~ [costinm,mandarjog]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

istio-testing · 2018-08-31T18:47:38Z

@nmittler: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
prow/istio-pilot-multicluster-e2e.sh	`842c111`	link	`/test istio-pilot-multicluster-e2e`

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

nmittler · 2018-08-31T19:43:37Z

/test e2e-mixer-no_auth
/test e2e-dashboard

Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.

* Cleaning up inject test files (#7188) There are a lot of files and its confusing which files are used by inject_test vs webhook_test. Splitting out the files into two subdirectories under testdata/ to make it clear. Making copies of files that are shared. * Adding helper functions to simplify injection templates (#7737) * Adding helper functions to simplify injection templates * addressing comments * More sidecar injector template cleanup (#7832) * More sidecar injector template cleanup * fixing * Add readiness probe for pilot agent. (#7465) Overview of the changes: - Adding new flag to the pilot agent "--statusPort". Defaults to 15003. - Adding an HttpGet readiness probe to the pilot-agent on the new status port. The agent (not Envoy) will open this port. When receiving an HTTP GET on the readiness path, the agent will probe Envoy, checking that it has received dynamic listeners and clusters. If so, it's "ready". - Poking a hole in the sidecar iptables for the new status port (via the -d option). - Updated both webhook and manual injection paths to generate the readiness probe. A lot of changes here for the test files.

nmittler requested review from ayj, costinm, ostromart and rshriram July 28, 2018 00:09

googlebot added the cla: yes label Jul 28, 2018

istio-testing requested review from andraxylia, esnible and sdake July 28, 2018 00:09

nmittler removed request for andraxylia, esnible and sdake July 28, 2018 00:11

rshriram reviewed Jul 29, 2018

View reviewed changes

pilot/pkg/kube/inject/mesh.go Outdated

Copy link
Copy Markdown

Member

rshriram Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec based instead of httpget.

rshriram reviewed Jul 29, 2018

View reviewed changes

ijsnellf reviewed Jul 30, 2018

View reviewed changes

costinm reviewed Jul 30, 2018

View reviewed changes

nmittler requested a review from mandarjog July 30, 2018 16:12

nmittler force-pushed the readiness2 branch from 10434f8 to 2b2a930 Compare July 31, 2018 22:24

This was referenced Aug 3, 2018

Document how to query proxy configuration #6642

Closed

Document list of ports assigned to the sidecar proxy #7699

Closed

nmittler force-pushed the readiness2 branch from e9fea37 to 3273af3 Compare August 24, 2018 22:57

liamawhite reviewed Aug 25, 2018

View reviewed changes

nmittler force-pushed the readiness2 branch from 2f5e564 to 9a981aa Compare August 27, 2018 17:54

Stono mentioned this pull request Aug 28, 2018

Signals to detect stale proxies #8320

Closed

nmittler force-pushed the readiness2 branch from 667f4b6 to 842c111 Compare August 31, 2018 17:42

costinm approved these changes Aug 31, 2018

View reviewed changes

istio-testing assigned costinm Aug 31, 2018

istio-testing added lgtm labels Aug 31, 2018

istio-testing assigned mandarjog Aug 31, 2018

PiotrSikora approved these changes Aug 31, 2018

View reviewed changes

sebastienvas merged commit 00abab7 into istio:master Aug 31, 2018

nmittler mentioned this pull request Sep 1, 2018

readinessProbe for Envoy #6145

Closed

nmittler mentioned this pull request Sep 19, 2018

Support startup dependencies between containers on the same Pod kubernetes/kubernetes#65502

Closed

Conversation

nmittler commented Jul 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmittler commented Jul 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rshriram left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

costinm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmittler Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmittler Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

costinm commented Jul 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmittler commented Jul 28, 2018 •

edited

Loading

nmittler Aug 7, 2018 •

edited

Loading

nmittler Aug 21, 2018 •

edited

Loading

codecov bot commented Jul 31, 2018 •

edited

Loading

rshriram commented Aug 1, 2018 •

edited

Loading

nmittler commented Aug 1, 2018 •

edited

Loading

nmittler commented Aug 24, 2018 •

edited

Loading