Title: HTTP 2 connection draining is non-graceful for low-volume listeners
Description:
When a listener enters into a draining state either due to a hot-restart or an LDS update, it should not accept new requests but should let existing requests finish gracefully. There is a drain-time limit that, when reached will terminate any open connections non-gracefully.
For HTTP 2, putting a connection into a draining state should be equivalent with Envoy sending a GOAWAY frame on the connection. This signals that no new streams should be created on the connection, but existing streams are allowed to finish. Assuming a well-behaved client (one that respects the GOAWAY), this should mirror the desired behavior for connection draining. However, Envoy does not send a GOAWAY proactively when the drain-time begins. Instead, Envoy issues a GOAWAY after the next request made on the connection is completed. Represented visually, this would look like:

With appropriately long drain-time for your traffic and a sufficiently busy listener, this delayed GOAWAY does not generally lead to issues. However, if the listener is processing a low volume of long-requests, then it is possible to find ourselves in the following scenario:

In this scenario the request does not begin until near the end of the drain-time window. Because the GOAWAY signal is not sent until the request ends. This results in a request being interrupted as the connection is non-gracefully closed -- non-graceful defined as 1) without a GOAWAY and 2) with in-flight requests.
An interrupted request is logged into the access log with the DC flag and will return a 503 response. If the downstream is another Envoy instance, then the downstream will have an access log with a UC flag.
I would expect that Envoy would issue a GOAWAY (NO_ERROR error-code) at the beginning of the drain-period (without the external tigger of a request) as this already matches the desired behavior of Envoy's connection-draining. I suspect the current implementation to be an artifact of how connections would be closed for persistent HTTP 1.
Repro steps:
This has been reproduced in our internal integration-test suite, but can be reproduced readily with the following:
- Configure drain-time of 20 seconds
- Configure parent-shutdown time of 25 seconds
- Start Envoy
- Create a client (either H1 or H2) and generate some traffic to ensure established connections
- Begin a reload or perform an LDS update
- Issue a request over the same connection that:
- Begins 10s after the reload/LDS-update was initiated
- Lasts for 30s (upstream service sleeps 30s before responding)
- Observe non-graceful connection termination
All tests were done using a concurrency of 1 to ensure a single listener/connection.
Title: HTTP 2 connection draining is non-graceful for low-volume listeners
Description:
When a listener enters into a draining state either due to a hot-restart or an LDS update, it should not accept new requests but should let existing requests finish gracefully. There is a drain-time limit that, when reached will terminate any open connections non-gracefully.
For HTTP 2, putting a connection into a draining state should be equivalent with Envoy sending a
GOAWAYframe on the connection. This signals that no new streams should be created on the connection, but existing streams are allowed to finish. Assuming a well-behaved client (one that respects theGOAWAY), this should mirror the desired behavior for connection draining. However, Envoy does not send aGOAWAYproactively when the drain-time begins. Instead, Envoy issues aGOAWAYafter the next request made on the connection is completed. Represented visually, this would look like:With appropriately long drain-time for your traffic and a sufficiently busy listener, this delayed
GOAWAYdoes not generally lead to issues. However, if the listener is processing a low volume of long-requests, then it is possible to find ourselves in the following scenario:In this scenario the request does not begin until near the end of the drain-time window. Because the
GOAWAYsignal is not sent until the request ends. This results in a request being interrupted as the connection is non-gracefully closed -- non-graceful defined as 1) without aGOAWAYand 2) with in-flight requests.An interrupted request is logged into the access log with the
DCflag and will return a503response. If the downstream is another Envoy instance, then the downstream will have an access log with aUCflag.I would expect that Envoy would issue a
GOAWAY(NO_ERRORerror-code) at the beginning of the drain-period (without the external tigger of a request) as this already matches the desired behavior of Envoy's connection-draining. I suspect the current implementation to be an artifact of how connections would be closed for persistent HTTP 1.Repro steps:
This has been reproduced in our internal integration-test suite, but can be reproduced readily with the following:
All tests were done using a concurrency of 1 to ensure a single listener/connection.