Reduce Envoy DNS Requests for STRICT_DNS clusters

## Problem

For STRICT_DNS clusters, Envoy will send a DNS requests every X seconds (5 by default). DNS lookups in Kubernetes can potentionally involve many requests. This is due to the search and ndots (default 5 in Kubernetes) configuration.

From my findings, the following number of DNS requests are needed

| Request                                                           | Lookups | Reason                                                                                            |
| ----------------------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------- |
| google.com                                                        | 6       | Looks though google.com.svc.cluster.local, etc before `google.com.`                               |
| google.com.                                                       | 1       | FQDN, so doesn't search                                                                           |
| istio-pilot.istio-system                                          | 2       | Searches `istio-pilot.istio-system.default.svc.cluster.local` (default is namespace of pod) first |
| istio-pilot.istio-system.svc.cluster.local                        | 6       | Must go through all searches first                                                                |
| istio-pilot.istio-system.svc.cluster.local.                       | 1       | FQDN, so doesn't search                                                                           |
| istio-zipkins.istio-system (without zipkins pod)                  | 6       | Searches all, NXDOMAIN for all                                                                    |
| istio-zipkin.istio-system.svc.cluster.local. (without zipkins pod) | 1       | Searches just the FQDN, NXDOMAIN                                                                  |

In a default installation, we set up clusters for `istio-pilot.istio-system` and `istio-zipkins.istio-system` (which does not exist by default). This means we will get `2 + 6` requests every 5 seconds per pod, or 1.6 RPS/Envoy. Every additional service entry will add another 1.2 RPS for each host AND each port in the service entry, per Envoy. Consider a cluster with 1000 pods, and 10 ServiceEntries with 1 host and 1 port each - we would expect 13.6K RPS just to the DNS server, which is very high.

Luckily, these DNS requests are performed asynchronously, so this is unlikely to have too much of a negative impact on latency.

The 5s lookup period is configurable, but currently this does not effect the bootstrap config (zipkins and pilot)


## Mitigation

1. Unambiguous FQDN (**with the trailing** `.`), are considering invalid by Galley and Pilot. This blocks us from using `istio-pilot.istio-system.svc.cluster.local.` as our address and reducing our DNS requests from 2 to 1. This also prevents users from using unambiguous FQDN in their ServiceEntries, which has an even bigger impact - 6 calls down to 2.
2. (If we do #1) we should use unambiguous FQDNs for the entries we add in the bootstrap configuration.
3. We should make the configurable DNS lookup time effect the bootstrap config.
4. In all our documentation we recommend using resolution: DNS on ServiceEntry to expose external services. This means that when doing curl google.com, curl resolves google.com and sends to the request to Envoy which uses its own DNS resolution, so we have double resolution going on. We should reduce examples of DNS resolution to only when it is needed, and recommend NONE instead (unless there are reasons to use DNS I am missing).
5. We should raise the default resolution interval from 5s to Xs (15s? 30s?).
6. We waste a lot of DNS requests on Zipkins in the out of the box. We should consider removing it from the bootstrap config unless it is explicitly enabled. A downside here is that pods would need restarts if the config is changed. However, this restriction already applies to switching to Datadog or Lightstep. It won't be present if we pass `global.proxy.tracer=""` rather than the default (zipkins). If we want to keep the default we should probably add a guard against `tracing.enabled`, or maybe `global.proxy.enableTracing`?

With these changes we can get the number of requests down to 1, from 8 (no zipkins) or 2, from 4 (with zipkins).

Still left unsolved is the issue of duplicating DNS lookups for each port. This would likely need to be a change in Envoy (to cache the lookups across clusters), which is probably not trivial/something that is even desired. Given most ServiceEntries seem to have 1 or 2 ports, this probably isn't too much of an issue.

After applying 2 and 6 locally, I saw DNS requests drop from, about 5x lower.

## Related Issues

https://github.com/istio/istio/issues/12968
https://github.com/istio/istio/issues/12181


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Envoy DNS Requests for STRICT_DNS clusters #13710

Problem

Mitigation

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request	Lookups	Reason
google.com	6	Looks though google.com.svc.cluster.local, etc before `google.com.`
google.com.	1	FQDN, so doesn't search
istio-pilot.istio-system	2	Searches `istio-pilot.istio-system.default.svc.cluster.local` (default is namespace of pod) first
istio-pilot.istio-system.svc.cluster.local	6	Must go through all searches first
istio-pilot.istio-system.svc.cluster.local.	1	FQDN, so doesn't search
istio-zipkins.istio-system (without zipkins pod)	6	Searches all, NXDOMAIN for all
istio-zipkin.istio-system.svc.cluster.local. (without zipkins pod)	1	Searches just the FQDN, NXDOMAIN

Reduce Envoy DNS Requests for STRICT_DNS clusters #13710

Description

Problem

Mitigation

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions