Changing priority level of a locality causes requests to fail

*Description*: Changing the priority level of one or more localities in EDS can cause requests to fail due to the fact that endpoint health checking is done per-priority (as was discussed previously in [#3327](https://github.com/envoyproxy/envoy/issues/3327). Essentially priority changes (e.g. swapping the priority of two localities) look to EDS code like endpoints being removed and added, and therefore health checking state is reset.

This issue is probably only visible when using the `drain_connections_on_host_removal: true` setting because connections will not be torn down until health checks fail, although that would mean twice the health checking and twice the connections.

*Repro steps*:
1) Create a cluster with `drain_connections_on_host_removal: true` and cluster load assignment with two localities, one with priority 0 and one with priority 1
2) generate traffic to this cluster. I was able to reproduce with just `while true; do curl $args; sleep .1; done`
3) In control plane, swap the priorities of the two localities.

In the Envoy access log, I see a failed request amidst successful ones: 
```
[2018-07-23T17:38:40.879Z] "GET /whoareyou HTTP/1.1" 200 - 0 17 2 2 "-" "curl/7.29.0" "178d8607-6924-4fe6-aa4b-335fd5b11949" "envoy-test.slug.gns.square" "10.1.3.5:32351"
[2018-07-23T17:38:40.990Z] "GET /whoareyou HTTP/1.1" 200 - 0 17 2 2 "-" "curl/7.29.0" "943ece9e-1daf-4cba-83d6-276ac4f5cd3c" "envoy-test.slug.gns.square" "10.1.3.5:32351"
[2018-07-23T17:38:41.100Z] "GET /whoareyou HTTP/1.1" 503 UH 0 19 0 - "-" "curl/7.29.0" "c3b4afb0-1821-4931-89c0-d8f2a7f3ab1f" "envoy-test.slug.gns.square" "-"
[2018-07-23T17:38:41.207Z] "GET /whoareyou HTTP/1.1" 200 - 0 18 9 9 "-" "curl/7.29.0" "29c1e2c3-1df1-41cb-b7f5-9499caefb6cf" "envoy-test.slug.gns.square" "10.1.3.26:32351"
[2018-07-23T17:38:41.325Z] "GET /whoareyou HTTP/1.1" 200 - 0 19 9 9 "-" "curl/7.29.0" "1e6974b2-37db-4751-8e4d-824cb76da78c" "envoy-test.slug.gns.square" "10.1.3.45:32351"
```
The client sees `no healthy upstream`.

This issue arises from a difference between the way we model cluster membership and the way the code does. The code seems to assume that priorities are static and immutable, whereas we view _localities_ as static and immutable with priorities that may change. My argument for thinking about it this way is that localities make reference to physical regions (region, zone, sub zone) but you might want to reprioritize regional routing, for instance during a disaster recovery exercise or if you add a new region.

Unfortunately this would probably require a pretty significant refactor to fix. With some guidance we might have some bandwidth to look into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing priority level of a locality causes requests to fail #3930

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Changing priority level of a locality causes requests to fail #3930

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions