Skip to content

http: better handling remote envoy draining #1423

@mattklein123

Description

@mattklein123

At Lyft, we are hitting this comment at very low rate:
https://github.com/lyft/envoy/blob/master/source/common/http/http2/conn_pool.cc#L148

This is roughly the flow:

  1. Remote Envoy has /healthcheck/fail called as part of a drain process.
  2. Active HC interval is set to 15s + 15s jitter.
  3. During this period, local Envoy will keep sending new streams to remote Envoy. These streams will be accepted, but GOAWAY will be sent along with response.
  4. This causes connection churning until local Envoy knows remote Envoy is HC failed.
  5. During this churn period it's possible that we keep cycling http/2 connection pool connections through primary and draining, which may result in active streams getting dropped when the draining connection gets replaced.

There are a few ways we can potentially think about fixing this:

  1. Make the http/2 connection pool a list vs. primary/draining. This is sub-optimal since we would still need to limit the # of outbound connections and can still run into this issue. It doesn't really fix it.
  2. Differentiate between /healthcheck/fail and /healthcheck/fail_no_drain in which the "no drain" option will respond with 503 to HC pings, but no draining will occur. In this case users can rely on a natural drain process and HC fail to do a rolling restart.
  3. Potentially add a new HTTP/2 connection level error code such as GOAWAY_AND_DONT_COME_BACK (@goaway suggests this option. And look at his GH name!). In this case we would treat this as an indication that the host should be immediately health check failed. I like this option at a conceptual level, but am a bit weary of adding a non-standard error code.
  4. Add a new header called x-envoy-upstream-draining which when set tells the downstream caller to immediately treat the host as HC failed. This header would be set when the response is sent followed by the normal GOAWAY. The local envoy could look at this header and then immediately HC fail the host as part of a "fast HC fail" process.

In general I'm in favor of (4) since I think it's the most flexible, could work for HTTP/1, and can be implemented by others if well documented.

@alyssawilk or anyone else, any thoughts on this?

Metadata

Metadata

Assignees

Labels

enhancementFeature requests. Not bugs or questions.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions