http: better handling remote envoy draining

At Lyft, we are hitting this comment at very low rate:
https://github.com/lyft/envoy/blob/master/source/common/http/http2/conn_pool.cc#L148

This is roughly the flow:

1. Remote Envoy has `/healthcheck/fail` called as part of a drain process.
2. Active HC interval is set to 15s + 15s jitter.
3. During this period, local Envoy will keep sending new streams to remote Envoy. These streams will be accepted, but GOAWAY will be sent along with response.
4. This causes connection churning until local Envoy knows remote Envoy is HC failed.
5. During this churn period it's possible that we keep cycling http/2 connection pool connections through primary and draining, which may result in active streams getting dropped when the draining connection gets replaced.

There are a few ways we can potentially think about fixing this:

1. Make the http/2 connection pool a list vs. primary/draining. This is sub-optimal since we would still need to limit the # of outbound connections and can still run into this issue. It doesn't really fix it.
2. Differentiate between `/healthcheck/fail` and `/healthcheck/fail_no_drain` in which the "no drain" option will respond with 503 to HC pings, but no draining will occur. In this case users can rely on a natural drain process and HC fail to do a rolling restart.
3. Potentially add a new HTTP/2 connection level error code such as GOAWAY_AND_DONT_COME_BACK (@goaway suggests this option. And look at his GH name!). In this case we would treat this as an indication that the host should be immediately health check failed. I like this option at a conceptual level, but am a bit weary of adding a non-standard error code.
4. Add a new header called `x-envoy-upstream-draining` which when set tells the downstream caller to immediately treat the host as HC failed. This header would be set when the response is sent followed by the normal GOAWAY. The local envoy could look at this header and then immediately HC fail the host as part of a "fast HC fail" process.

In general I'm in favor of (4) since I think it's the most flexible, could work for HTTP/1, and can be implemented by others if well documented.

@alyssawilk or anyone else, any thoughts on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http: better handling remote envoy draining #1423

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

http: better handling remote envoy draining #1423

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions