Skip to content

Reduce Envoy DNS Requests for STRICT_DNS clusters #13710

@howardjohn

Description

@howardjohn

Problem

For STRICT_DNS clusters, Envoy will send a DNS requests every X seconds (5 by default). DNS lookups in Kubernetes can potentionally involve many requests. This is due to the search and ndots (default 5 in Kubernetes) configuration.

From my findings, the following number of DNS requests are needed

Request Lookups Reason
google.com 6 Looks though google.com.svc.cluster.local, etc before google.com.
google.com. 1 FQDN, so doesn't search
istio-pilot.istio-system 2 Searches istio-pilot.istio-system.default.svc.cluster.local (default is namespace of pod) first
istio-pilot.istio-system.svc.cluster.local 6 Must go through all searches first
istio-pilot.istio-system.svc.cluster.local. 1 FQDN, so doesn't search
istio-zipkins.istio-system (without zipkins pod) 6 Searches all, NXDOMAIN for all
istio-zipkin.istio-system.svc.cluster.local. (without zipkins pod) 1 Searches just the FQDN, NXDOMAIN

In a default installation, we set up clusters for istio-pilot.istio-system and istio-zipkins.istio-system (which does not exist by default). This means we will get 2 + 6 requests every 5 seconds per pod, or 1.6 RPS/Envoy. Every additional service entry will add another 1.2 RPS for each host AND each port in the service entry, per Envoy. Consider a cluster with 1000 pods, and 10 ServiceEntries with 1 host and 1 port each - we would expect 13.6K RPS just to the DNS server, which is very high.

Luckily, these DNS requests are performed asynchronously, so this is unlikely to have too much of a negative impact on latency.

The 5s lookup period is configurable, but currently this does not effect the bootstrap config (zipkins and pilot)

Mitigation

  1. Unambiguous FQDN (with the trailing .), are considering invalid by Galley and Pilot. This blocks us from using istio-pilot.istio-system.svc.cluster.local. as our address and reducing our DNS requests from 2 to 1. This also prevents users from using unambiguous FQDN in their ServiceEntries, which has an even bigger impact - 6 calls down to 2.
  2. (If we do Update README.md #1) we should use unambiguous FQDNs for the entries we add in the bootstrap configuration.
  3. We should make the configurable DNS lookup time effect the bootstrap config.
  4. In all our documentation we recommend using resolution: DNS on ServiceEntry to expose external services. This means that when doing curl google.com, curl resolves google.com and sends to the request to Envoy which uses its own DNS resolution, so we have double resolution going on. We should reduce examples of DNS resolution to only when it is needed, and recommend NONE instead (unless there are reasons to use DNS I am missing).
  5. We should raise the default resolution interval from 5s to Xs (15s? 30s?).
  6. We waste a lot of DNS requests on Zipkins in the out of the box. We should consider removing it from the bootstrap config unless it is explicitly enabled. A downside here is that pods would need restarts if the config is changed. However, this restriction already applies to switching to Datadog or Lightstep. It won't be present if we pass global.proxy.tracer="" rather than the default (zipkins). If we want to keep the default we should probably add a guard against tracing.enabled, or maybe global.proxy.enableTracing?

With these changes we can get the number of requests down to 1, from 8 (no zipkins) or 2, from 4 (with zipkins).

Still left unsolved is the issue of duplicating DNS lookups for each port. This would likely need to be a change in Envoy (to cache the lookups across clusters), which is probably not trivial/something that is even desired. Given most ServiceEntries seem to have 1 or 2 ports, this probably isn't too much of an issue.

After applying 2 and 6 locally, I saw DNS requests drop from, about 5x lower.

Related Issues

#12968
#12181

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions