Set stream_idle_timeout to 0s for xDS listener in pilot sidecar config#19043
Set stream_idle_timeout to 0s for xDS listener in pilot sidecar config#19043istio-testing merged 1 commit intoistio:masterfrom
Conversation
Envoy's default value for `stream_idle_timeout` is [5 minutes](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/network/http_connection_manager/v2/http_connection_manager.proto#envoy-api-field-config-filter-network-http-connection-manager-v2-httpconnectionmanager-stream-idle-timeout). Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.
|
Hi @joeyb. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Do we send keep alive packets? Can/should we? |
|
@howardjohn - afaik, we don't send keep alives right now, but I think that would be another valid way to handle this. I haven't tested it yet for solving this particular problem, but here's some prior art from a related envoy issue: envoyproxy/envoy#5173 (comment). Configuring the keep alives is a bit different because it would be done on the client side (i.e. in the |
|
Seems like this change makes us more vulnerable to half closed connections , and keep alive may be more useful. Outside of my expertise though, so would be good if someone else could weigh in |
|
Yup, makes sense. I’ll run some tests to confirm that keep alives resolve this particular issue. If so, I’m fine with either approach. |
|
Even if we set keep alives, which we already do here, https://github.com/istio/istio/blob/master/tools/packaging/common/envoy_bootstrap_v2.json#L403, as mentioned by @joeyb , the server side (ingress listener) runs in to stream idle timeout and it resets the stream to local cluster which will force pilot to disconnect the Envoy, IIRC. I do not know if adding TCP keepalives to |
|
is the problem the keep alive time is 5min and the idle timeout is also 5min so there is a race? or if envoy is sending the keep alive packets it will still consider itself idle? |
|
Is stream_idle_timeout set to 0s in release 1.4? If so, in my env the period is still 30min. |
|
@hzxuzhonghu the max connection is 30min, idle timeout is 5m. So its only 5min if there is no config sent at all, otherwise it would be 30min |
|
My cluster is silent, I believe it is the grpc keep alive that makes it alive. |
|
Was the grpc keep alive added recently? And just to be sure, are you running Envoy in front of Pilot? If not, you may see a different behaviour. Here is what we see - it disconnects every 5 mins |
|
Weird, I installed by |
|
The TCP client (at TCP layer) doesn't reset the GRPC layer stream timeout, so even we set TCP keep alive the Envoy in front of pilot may kill the stream (not the connection) for the idle timeout. /ok-to-test |
|
/retest |
|
@joeyb: The following test failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
I ran some tests locally and confirmed that setting the tcp keepalive alone is not enough to reset the timeout. @hzxuzhonghu - like Rama mentioned, I suspect that maybe you don't have envoy running next to your pilot instance? Unless I'm missing it somewhere, I don't think we have gRPC keepalives configured. Setting Assuming we go with this approach, we'll also need a corresponding PR in the installer repo since it configures the pilot envoy bootstrap via a configmap rather than using the template included in the docker image. |
See istio/istio#19043 for prior discussion. Both pilot and galley set max client ages in order to promote rebalancing, but the default `stream_idle_timeout` value of `5m` is lower than the default max client age of `30m`. This is causing reconnects to occur more frequently than intended. This change disables the idle timeout entirely, giving control back to pilot and galley over when the client connections are killed. Mixer does not seem to set the max client age, so the `stream_idle_timeout` is only being set for the static listener for outbound connections to galley.
I did have. I remmember we have grpc keep-alive enabled on pilot side, the period is 30s by default. I think this is related. |
|
@hzxuzhonghu I think this keep alive config are for connection right? The connection may still be alive but the stream (ads stream with Envoy) might still get timed out for 5 mins if it is idle, because Envoy's stream timeout are per stream https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts#stream-timeouts |
See istio/istio#19043 for prior discussion. Both pilot and galley set max client ages in order to promote rebalancing, but the default `stream_idle_timeout` value of `5m` is lower than the default max client age of `30m`. This is causing reconnects to occur more frequently than intended. This change disables the idle timeout entirely, giving control back to pilot and galley over when the client connections are killed. Mixer does not seem to set the max client age, so the `stream_idle_timeout` is only being set for the static listener for outbound connections to galley.
istio#19043) Envoy's default value for `stream_idle_timeout` is [5 minutes](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/network/http_connection_manager/v2/http_connection_manager.proto#envoy-api-field-config-filter-network-http-connection-manager-v2-httpconnectionmanager-stream-idle-timeout). Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.
Envoy's default value for
stream_idle_timeoutis 5 minutes. Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.