Skip to content

Add readiness probe for pilot agent.#7465

Merged
sebastienvas merged 1 commit intoistio:masterfrom
nmittler:readiness2
Aug 31, 2018
Merged

Add readiness probe for pilot agent.#7465
sebastienvas merged 1 commit intoistio:masterfrom
nmittler:readiness2

Conversation

@nmittler
Copy link
Copy Markdown
Contributor

@nmittler nmittler commented Jul 28, 2018

Overview of the changes:

  • Adding new flag to the pilot agent "--statusPort". Defaults to 15020.

  • Adding an HttpGet readiness probe to the pilot-agent
    on the new status port. The agent (not Envoy) will
    open this port. When receiving an HTTP GET on the readiness path,
    the agent will probe Envoy, checking that it has received dynamic
    listeners and clusters. If so, it's "ready".

  • Poking a hole in the sidecar iptables for the new status port
    (via the -d option).

  • Updated both webhook and manual injection paths to generate the
    readiness probe. A lot of changes here for the test files.

@nmittler
Copy link
Copy Markdown
Contributor Author

I'm sending out the PR a little early because I'd like to get feedback on the general approach. We still need to determine whether we should allow opt-in/out, what should be the default, and how to configure it (annotations?). We can use this issue for discussion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? Do we need to grab another port? Also, this is probably colliding with the http proxy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a port - and we should start a list documenting all ports used by istio. Yes, 15003 is used by the http proxy ( which we still support and is needed in some environments )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - we don't have a default port for http_proxy, and 15003 was used in an old version for pilot v1. We can keep it as 15003 - but should still have an 'assigned ports' list somewhere in the wiki.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #7699 to address the documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need any of this logic. Use an exec based command and run curl localhost: . This will eliminate the need to add another exception port, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, no curl and exec. We already use 15000 and 15003 - adding a third one is not a huge deal.

We need it for metrics as well, and it's more efficient and flexible and works in other environments too. ( raw vm with consul and many others also have liveness checks ).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec based instead of httpget.

Copy link
Copy Markdown
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest removing the httpGet and making the health check exec based, on a local port. Otherwise, we have all sorts of security issues popping up.

Second, I think this PR should go into 1.0, given that the lack of a health check is a common pain most users experience. If the tests pass, i see no reason to not include this in 1.0, as it improves stability. @costinm thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more than log. Exit would probably be right - readiness would fail anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be /healthz?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as described in the health checking doc, it should be /healthz/ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goimports

Copy link
Copy Markdown
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got trough the first half.

Main concern is that adding a 2 second delay on all pods in the mesh, and a readiness probe by default is far too broad and damaging.

I still need to understand how 'ready' is determined - at least having the inbound listeners is required to be really ready, and this should also work for 'health'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons:

  • It helps to simplify the template logic
  • Its consistent with some other things that I'm doing where calling a method is doing things that are non-trivial to do in the template.
  • It removes the trailing , which is incorrect and may or may not have undesired affects in the script.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need health probe. And maybe use /readyz for readiness and leave healthz for health ?

Copy link
Copy Markdown
Contributor Author

@nmittler nmittler Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health probes are not covered by this PR. As discussed earlier, readiness will be /healthz/ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delaying each pod with 2 seconds may be pretty bad for many apps.
Also not sure the global setting evaluated only at install time is going to help ( changing
mesh config after install will have no effect, even with restart of pilot and sidecar injector - it is evaluated at install by helm ).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra flags for istioctl. Keep feature parity with sidecar injection.

Users should add an annotation to the pod to customize per-pod settings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no extra flags? Can you point to a feature that is configured in a way other than flags that I can use as a reference?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

istioctl is fetching the configmap. We can add those to the configmap if you want - the envoy port is also set there AFAIK.

It is fine to use flags for things users should control - but almost all of those flags will just confuse user.
There are very rare cases when a user will need to touch those - and the annotations are sufficient for that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides - the readiness path is a const in the first place ( and should be - we should only add options when there is a clear use case).

Copy link
Copy Markdown
Contributor Author

@nmittler nmittler Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to the configmap seems reasonable ... not against it. Seems like all of this should move there, no?

The only issue I see is that wouldn't allow customizing the injection for a single deployment, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more than log. Exit would probably be right - readiness would fail anyways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add a dependency on the entire package ( which AFAIK will link all symbols) just to get a constant.

It is ok to just duplicate the constant value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - will simplify the template.
We should probably add more - in particular around the config options, so we can have a clean 'default to
the current mesh config ( not the install time values ), with annotation-based overrides.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that to follow-on work, as this PR is already huge.

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Jul 30, 2018

And far too risky and too late for 1.0, but this should be a priority for 1.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be linked in agent - we don't want to pull all ads stuff and deps and create confusion. Agent-linked code needs to be separate package ( we also need to untangle it from the other packages ).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rely on stats only if possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your concern with config dump?

@nmittler nmittler requested a review from mandarjog July 30, 2018 16:12
@mandarjog
Copy link
Copy Markdown
Contributor

Issues with exec probes:

  1. It is very k8s centric.
  2. Running exec curl from the proxy container will mark the proxy container bad even though the app container needs restarting. Every container needs its own dedicated liveness / readiness checks for this reason.
  3. You may not have the ability to execute code in the environment.
  4. There is a work around by mounting a statically linked curl like binary into the application container. Even that may not always work if you are using alpine or similar.

network probes address all these problems.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 31, 2018

Codecov Report

Merging #7465 into master will decrease coverage by 1%.
The diff coverage is 68%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #7465    +/-   ##
=======================================
- Coverage      72%     71%   -<1%     
=======================================
  Files         379     374     -5     
  Lines       33570   33071   -499     
=======================================
- Hits        23930   23285   -645     
- Misses       8568    8773   +205     
+ Partials     1072    1013    -59
Impacted Files Coverage Δ
istioctl/cmd/istioctl/kubeinject.go 46% <40%> (+2%) ⬆️
pilot/pkg/kube/inject/inject.go 81% <90%> (+1%) ⬆️
...k/components/apps/local/envoy/agent/pilot/agent.go 2% <0%> (-63%) ⬇️
pkg/mcp/creds/create.go 0% <0%> (-55%) ⬇️
pilot/pkg/proxy/envoy/watcher.go 57% <0%> (-31%) ⬇️
pkg/mcp/creds/watcher.go 68% <0%> (-19%) ⬇️
...t/pkg/serviceregistry/external/servicediscovery.go 79% <0%> (-11%) ⬇️
mixer/pkg/runtime/handler/env.go 85% <0%> (-10%) ⬇️
mixer/pkg/attribute/mutableBag.go 90% <0%> (-9%) ⬇️
pilot/pkg/serviceregistry/aggregate/controller.go 69% <0%> (-7%) ⬇️
... and 117 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15e4901...842c111. Read the comment docs.

@mandarjog
Copy link
Copy Markdown
Contributor

  1. Update /health to /health/liveness and /health/readiness
  2. Per design doc we actually need /health/<container>/liveness, both can be services by agent or proxy.
  3. For readiness, we do want it to be AND of proxy ready and application ready.
    We don't want to start receiving traffic when only proxy is ready but application is not.

@rshriram
Copy link
Copy Markdown
Member

rshriram commented Aug 1, 2018

Issues with exec probes:

It is very k8s centric.

Yes, but the problem is also k8s centric.

Running exec curl from the proxy container will mark the proxy container bad even though the app container needs restarting. Every container needs its own dedicated liveness / readiness checks for this reason.

I don't see how this is a problem with exec. This problem persists even if we use network probe

You may not have the ability to execute code in the environment.

?? The container is running code. The exec runs in the context of the container. If curl is an issue, simply add the capability to pilot-agent itself.. /usr/bin/pilot-agent healthcheck
and let the command talk over a local unix domain socket to the running pilot agent and see if things are working.

There is a work around by mounting a statically linked curl like binary into the application container.
Even that may not always work if you are using alpine or similar.

Curl is from the proxy container. This PR is about health checking envoy. Not daisy chaining to application's health check, because not all apps use http health check. They have TCP checks as well. They also have exec based checks when they have to check for multiple things.

Net: except for 1, I don't see how network probes solve anything. I would suggest simply making pilot agent listen on a unix domain socket and running the pilot-agent healthcheck command which talks to the UDS to check if the agent is running and healthy. And run this as exec command. Opening a port in Envoy and then establishing a special lsitener to route it back to pilot agent, etc. is making the envoy setup more complicated.

If you re-use code from proxy-status command, yuo could even do a smarter check where the agent can ensure the envoy has had atleast one successful ADS request (by checking envoy stats). In fact, that should be the health check instead of curling admin endpoint.

@nmittler
Copy link
Copy Markdown
Contributor Author

nmittler commented Aug 1, 2018

@mandarjog

  1. Update /health to /health/liveness and /health/readiness

makes sense.

  1. Per design doc we actually need /health/<container>/liveness, both can be services by agent or proxy.

Discussed offline. The doc you're referring to is here

  1. For readiness, we do want it to be AND of proxy ready and application ready.
    We don't want to start receiving traffic when only proxy is ready but application is not.

IIUC, this is how readiness works in k8s already (i.e. traffic won't be received by the pod until all containers are "ready"). So adding just a readiness probe to the proxy container should be sufficient for this PR. Application-level readiness is up to the developer.

@mandarjog
Copy link
Copy Markdown
Contributor

@liamawhite good point. In your example, there were already a few inbound clusters present. Just the important inbound cluster was missing.
In order to detect this, we should wait for all inbound clusters corresponding to declared container ports to show up.

@nmittler
Copy link
Copy Markdown
Contributor Author

@mandarjog @liamawhite In this case, are you saying that only a subset of the clusters were received from an ADS update? Or was everything received, but not all clusters/listeners were warmed?

@liamawhite
Copy link
Copy Markdown
Member

All things have been received as far as Envoy/Pilot are concerned. However, Kubernetes hasn't yet added the endpoints so Pilot doesn't know it needs to add the correct inbound clusters.

@nmittler
Copy link
Copy Markdown
Contributor Author

nmittler commented Aug 24, 2018

@liamawhite got it ... makes sense. And good catch! Should be an easy fix.

I'm not sure if I can rely on /clusters because the output looks like it's a subset of /stats. There is currently some discussion regarding eliminating clusters from stats to reduce the memory footprint of Envoy. However, the output of /listeners might be an option here.

@PiotrSikora can you confirm that using /listeners will be reasonably future-proof? I suspect that config_dump might be processing intensive, so I'd rather avoid it if possible.

@nmittler
Copy link
Copy Markdown
Contributor Author

@mandarjog @liamawhite @PiotrSikora PTAL. I've added the exposed application ports to the readiness check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the logic in istio.io/pilot/pkg? Maybe in the proxy directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it there originally, but there was a concern about proxy-agent pulling in too many dependencies. I suppose so long as it's in its own package it should be fine. Do you have a preference of location?

@andraxylia
Copy link
Copy Markdown
Contributor

@liamawhite k8s should be equally slow for inbound and outbound clusters, what do you mean? There was also a fix to generate listeners for endpoints not ready #6992

@nmittler
Copy link
Copy Markdown
Contributor Author

nmittler commented Aug 26, 2018

@andraxylia the requested check has already been added to this PR. See https://github.com/istio/istio/pull/7465/files#diff-293528584c1dc78a315e29fb2df2bc1a

@liamawhite
Copy link
Copy Markdown
Member

@andraxylia In some cases, it takes Kube a while to add an endpoint for the pod. Until Kube adds that endpoint Pilot doesn't know which inbound clusters to add so the pod can't correctly handle incoming requests.

@nmittler
Copy link
Copy Markdown
Contributor Author

For those interested, I've shared a design doc for Istio Proxy liveness/readiness.

Overview of the changes:

- Adding new flag to the pilot agent "--statusPort". Defaults to 15003.

- Adding an HttpGet readiness probe to the pilot-agent
on the new status port. The agent (not Envoy) will
open this port. When receiving an HTTP GET on the readiness path,
the agent will probe Envoy, checking that it has received dynamic
listeners and clusters. If so, it's "ready".

- Poking a hole in the sidecar iptables for the new status port
(via the -d option).

- Updated both webhook and manual injection paths to generate the
readiness probe. A lot of changes here for the test files.
Copy link
Copy Markdown
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@mandarjog
Copy link
Copy Markdown
Contributor

/lgtm

@istio-testing
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: costinm, mandarjog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@istio-testing
Copy link
Copy Markdown
Collaborator

istio-testing commented Aug 31, 2018

@nmittler: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/istio-pilot-multicluster-e2e.sh 842c111 link /test istio-pilot-multicluster-e2e
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@nmittler
Copy link
Copy Markdown
Contributor Author

/test e2e-mixer-no_auth
/test e2e-dashboard

@sebastienvas sebastienvas merged commit 00abab7 into istio:master Aug 31, 2018
nmittler pushed a commit to nmittler/istio that referenced this pull request Sep 11, 2018
Overview of the changes:

- Adding new flag to the pilot agent "--statusPort". Defaults to 15003.

- Adding an HttpGet readiness probe to the pilot-agent
on the new status port. The agent (not Envoy) will
open this port. When receiving an HTTP GET on the readiness path,
the agent will probe Envoy, checking that it has received dynamic
listeners and clusters. If so, it's "ready".

- Poking a hole in the sidecar iptables for the new status port
(via the -d option).

- Updated both webhook and manual injection paths to generate the
readiness probe. A lot of changes here for the test files.
mandarjog pushed a commit that referenced this pull request Sep 12, 2018
* Cleaning up inject test files (#7188)

There are a lot of files and its confusing which files are used by
inject_test vs webhook_test. Splitting out the files into two
subdirectories under testdata/ to make it clear.  Making copies of
files that are shared.

* Adding helper functions to simplify injection templates (#7737)

* Adding helper functions to simplify injection templates

* addressing comments

* More sidecar injector template cleanup (#7832)

* More sidecar injector template cleanup

* fixing

* Add readiness probe for pilot agent. (#7465)

Overview of the changes:

- Adding new flag to the pilot agent "--statusPort". Defaults to 15003.

- Adding an HttpGet readiness probe to the pilot-agent
on the new status port. The agent (not Envoy) will
open this port. When receiving an HTTP GET on the readiness path,
the agent will probe Envoy, checking that it has received dynamic
listeners and clusters. If so, it's "ready".

- Poking a hole in the sidecar iptables for the new status port
(via the -d option).

- Updated both webhook and manual injection paths to generate the
readiness probe. A lot of changes here for the test files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.