Transient 503 UH "no healthy upstream" errors during CDS updates

**Title**: **Envoy briefly fails to route requests during CDS updates**

**Description**:
When a cluster is updated via CDS, we sometimes observe request failures in the form of HTTP 503 "no healthy upstream" with the `UH` response flag.
The membership total for the cluster remains constant throughout the update.
The membership healthy falls down to 0 shortly after the CDS update begins, then switches back to normal shortly after the CDS update completed.
Request failures do occur regardless of the panic routing thresholds.

The docs around cluster warming (in https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager#cluster-warming) suggest there should be no traffic disruption during a CDS update. 

**Repro steps**:
We observed and reproduced this behavior on static or EDS clusters, by sending a CDS update to change the health-checking timeout (eg. from 5s to 10s and back).
For example, we used an EDS cluster with 1% panic routing threshold configured around the lines of:
```
{
    "@type": "type.googleapis.com/envoy.api.v2.Cluster",
    "name": "foo",
    "type": "EDS",
    "eds_cluster_config": {
        "eds_config": {
            "api_config_source": {
                "api_type": "GRPC",
                "grpc_services": [ { "envoy_grpc": { "cluster_name": "xxx" } } ]
            }
        }
    },
    "connect_timeout": "10s",
    "health_checks": [ {
        "timeout": "5s",
        "interval": "3s",
        "unhealthy_threshold": 1,
        "healthy_threshold": 2,
        "reuse_connection": true,
        "http_health_check": {
            "host": "xxx",
            "path": "/healthcheck",
        },
        "no_traffic_interval": "3s"
    } ],
    "circuit_breakers": {
        "thresholds": [ {
            "max_connections": 4294967295,
            "max_pending_requests": 4294967295,
            "max_requests": 4294967295,
            "retry_budget": {}
        } ]
    },
    "dns_lookup_family": "V4_ONLY",
    "common_lb_config": { "healthy_panic_threshold": { "value": 1 } },
    "transport_socket": {
        "name": "envoy.transport_sockets.tls",
        "typed_config": {
            "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",
            "common_tls_context": {
                "tls_certificates": [ {
                        "certificate_chain": { "filename": "/path/to/cert.pem" },
                        "private_key": { "filename": "/path/to/key.pem" }
                } ],
                "validation_context": {
                    "trusted_ca": { "filename": "/path/to/ca.pem" },
                    "match_subject_alt_names": [ { "exact": "xxx" } ]
                }
            }
        }
    },
    "upstream_connection_options": {
        "tcp_keepalive": {
            "keepalive_probes": 9,
            "keepalive_time": 300,
            "keepalive_interval": 60
        }
    }
}
```

The cluster load assignment included a stable pool of 10 healthy backends.
Envoy does not appear to make any health-check request during the CDS update.
The backends are hard-coded to report healthy, and run locally on the same host to minimize possible network interferences.

**Logs**:
We've reproduced this using a version of Envoy 1.14.5-dev with some extra log statements:
```
2020-09-11T18:05:48 cluster.foo.membership_healthy = 10
[...]
2020-09-11T18:05:49 Traffic generator: 1428 requests, 0 errors
[...]
2020-09-11T18:05:54 Traffic generator: 2590 requests, 0 errors
2020-09-11T18:05:54 Updating foo HC timeout to 5s
2020-09-11T18:05:55 Traffic generator: 2962 requests, 0 errors
[...]
2020-09-11T18:05:57 Traffic generator: 3024 requests, 0 errors
2020-09-11T18:05:57 [backend@0.0.0.0:6878] responding to /healthcheck request from 192.168.0.11:44078 (for [192.168.0.12]) with OK
2020-09-11T18:05:57 [backend@0.0.0.0:6872] responding to /healthcheck request from 192.168.0.5:37884 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6871] responding to /healthcheck request from 192.168.0.4:49690 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6877] responding to /healthcheck request from 192.168.0.10:36026 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6876] responding to /healthcheck request from 192.168.0.9:46476 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6870] responding to /healthcheck request from 192.168.0.3:46406 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6873] responding to /healthcheck request from 192.168.0.6:58104 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6875] responding to /healthcheck request from 192.168.0.8:34440 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6869] responding to /healthcheck request from 192.168.0.2:47934 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [backend@0.0.0.0:6874] responding to /healthcheck request from 192.168.0.7:45602 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 Traffic generator: 3640 requests, 0 errors
[...]
2020-09-11T18:05:58 Traffic generator: 3554 requests, 0 errors
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/cluster_manager_impl.cc:582] add/update cluster foo starting warming
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/upstream_impl.cc:951] ClusterImplBase::initialize()
[2020-09-11 18:05:58.913][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 0
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:967] initializing Secondary cluster foo completed
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:972] ClusterImplBase::onInitDone()
2020-09-11T18:05:58 cluster.foo.membership_healthy = 0
[2020-09-11 18:05:58.976][401][upstream] [source/common/upstream/upstream_impl.cc:992] ClusterImplBase::finishInitialization()
[2020-09-11 18:05:58.976][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 0 added and 0 removed, updating hosts from 10 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:594] warming cluster foo complete
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e21a340
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:369] onClusterInit
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e4189c0
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.981][665][misc] [source/common/upstream/load_balancer_impl.cc:808] chooseHostOnce: hosts_to_use.empty with source type AllHosts
[2020-09-11 18:05:58.981][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1244] no healthy host for HTTP connection pool 0x563f2e4189c0
[2020-09-11 18:05:58.981][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.985][665][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
2020-09-11T18:05:58 Error making request: Expected HTTP code 200, but got 503 (with body "no healthy upstream")
2020-09-11T18:05:58 cluster.foo.membership_healthy = 10
2020-09-11T18:05:59 Traffic generator: 3441 requests, 1 errors
```

Based on the trace above, it *looks* like a load-balancer of the updated cluster is used to route a request before its underlying host set is initialized.
I'm not familiar enough with Envoy internals and could use some help to understand if I'm somehow misconfiguring Envoy, or if something else is happening.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions