-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Closed
Labels
enhancementFeature requests. Not bugs or questions.Feature requests. Not bugs or questions.
Milestone
Description
At Lyft, we are hitting this comment at very low rate:
https://github.com/lyft/envoy/blob/master/source/common/http/http2/conn_pool.cc#L148
This is roughly the flow:
- Remote Envoy has
/healthcheck/failcalled as part of a drain process. - Active HC interval is set to 15s + 15s jitter.
- During this period, local Envoy will keep sending new streams to remote Envoy. These streams will be accepted, but GOAWAY will be sent along with response.
- This causes connection churning until local Envoy knows remote Envoy is HC failed.
- During this churn period it's possible that we keep cycling http/2 connection pool connections through primary and draining, which may result in active streams getting dropped when the draining connection gets replaced.
There are a few ways we can potentially think about fixing this:
- Make the http/2 connection pool a list vs. primary/draining. This is sub-optimal since we would still need to limit the # of outbound connections and can still run into this issue. It doesn't really fix it.
- Differentiate between
/healthcheck/failand/healthcheck/fail_no_drainin which the "no drain" option will respond with 503 to HC pings, but no draining will occur. In this case users can rely on a natural drain process and HC fail to do a rolling restart. - Potentially add a new HTTP/2 connection level error code such as GOAWAY_AND_DONT_COME_BACK (@goaway suggests this option. And look at his GH name!). In this case we would treat this as an indication that the host should be immediately health check failed. I like this option at a conceptual level, but am a bit weary of adding a non-standard error code.
- Add a new header called
x-envoy-upstream-drainingwhich when set tells the downstream caller to immediately treat the host as HC failed. This header would be set when the response is sent followed by the normal GOAWAY. The local envoy could look at this header and then immediately HC fail the host as part of a "fast HC fail" process.
In general I'm in favor of (4) since I think it's the most flexible, could work for HTTP/1, and can be implemented by others if well documented.
@alyssawilk or anyone else, any thoughts on this?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementFeature requests. Not bugs or questions.Feature requests. Not bugs or questions.