-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description: Changing the priority level of one or more localities in EDS can cause requests to fail due to the fact that endpoint health checking is done per-priority (as was discussed previously in #3327. Essentially priority changes (e.g. swapping the priority of two localities) look to EDS code like endpoints being removed and added, and therefore health checking state is reset.
This issue is probably only visible when using the drain_connections_on_host_removal: true setting because connections will not be torn down until health checks fail, although that would mean twice the health checking and twice the connections.
Repro steps:
- Create a cluster with
drain_connections_on_host_removal: trueand cluster load assignment with two localities, one with priority 0 and one with priority 1 - generate traffic to this cluster. I was able to reproduce with just
while true; do curl $args; sleep .1; done - In control plane, swap the priorities of the two localities.
In the Envoy access log, I see a failed request amidst successful ones:
[2018-07-23T17:38:40.879Z] "GET /whoareyou HTTP/1.1" 200 - 0 17 2 2 "-" "curl/7.29.0" "178d8607-6924-4fe6-aa4b-335fd5b11949" "envoy-test.slug.gns.square" "10.1.3.5:32351"
[2018-07-23T17:38:40.990Z] "GET /whoareyou HTTP/1.1" 200 - 0 17 2 2 "-" "curl/7.29.0" "943ece9e-1daf-4cba-83d6-276ac4f5cd3c" "envoy-test.slug.gns.square" "10.1.3.5:32351"
[2018-07-23T17:38:41.100Z] "GET /whoareyou HTTP/1.1" 503 UH 0 19 0 - "-" "curl/7.29.0" "c3b4afb0-1821-4931-89c0-d8f2a7f3ab1f" "envoy-test.slug.gns.square" "-"
[2018-07-23T17:38:41.207Z] "GET /whoareyou HTTP/1.1" 200 - 0 18 9 9 "-" "curl/7.29.0" "29c1e2c3-1df1-41cb-b7f5-9499caefb6cf" "envoy-test.slug.gns.square" "10.1.3.26:32351"
[2018-07-23T17:38:41.325Z] "GET /whoareyou HTTP/1.1" 200 - 0 19 9 9 "-" "curl/7.29.0" "1e6974b2-37db-4751-8e4d-824cb76da78c" "envoy-test.slug.gns.square" "10.1.3.45:32351"
The client sees no healthy upstream.
This issue arises from a difference between the way we model cluster membership and the way the code does. The code seems to assume that priorities are static and immutable, whereas we view localities as static and immutable with priorities that may change. My argument for thinking about it this way is that localities make reference to physical regions (region, zone, sub zone) but you might want to reprioritize regional routing, for instance during a disaster recovery exercise or if you add a new region.
Unfortunately this would probably require a pretty significant refactor to fix. With some guidance we might have some bandwidth to look into this.