Skip to content

Set stream_idle_timeout to 0s for xDS listener in pilot sidecar config#19043

Merged
istio-testing merged 1 commit intoistio:masterfrom
joeyb:pilot-stream_idle_timeout
Nov 20, 2019
Merged

Set stream_idle_timeout to 0s for xDS listener in pilot sidecar config#19043
istio-testing merged 1 commit intoistio:masterfrom
joeyb:pilot-stream_idle_timeout

Conversation

@joeyb
Copy link
Copy Markdown
Contributor

@joeyb joeyb commented Nov 18, 2019

Envoy's default value for stream_idle_timeout is 5 minutes. Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.

Envoy's default value for `stream_idle_timeout` is [5 minutes](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/network/http_connection_manager/v2/http_connection_manager.proto#envoy-api-field-config-filter-network-http-connection-manager-v2-httpconnectionmanager-stream-idle-timeout). Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.
@joeyb joeyb requested a review from a team as a code owner November 18, 2019 16:14
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Nov 18, 2019
@istio-testing istio-testing added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 18, 2019
@istio-testing
Copy link
Copy Markdown
Collaborator

Hi @joeyb. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@howardjohn
Copy link
Copy Markdown
Member

Do we send keep alive packets? Can/should we?

@joeyb
Copy link
Copy Markdown
Contributor Author

joeyb commented Nov 18, 2019

@howardjohn - afaik, we don't send keep alives right now, but I think that would be another valid way to handle this. I haven't tested it yet for solving this particular problem, but here's some prior art from a related envoy issue: envoyproxy/envoy#5173 (comment). Configuring the keep alives is a bit different because it would be done on the client side (i.e. in the xds-grpc cluster config) rather than on the server side listener.

@howardjohn
Copy link
Copy Markdown
Member

Seems like this change makes us more vulnerable to half closed connections , and keep alive may be more useful. Outside of my expertise though, so would be good if someone else could weigh in

@joeyb
Copy link
Copy Markdown
Contributor Author

joeyb commented Nov 18, 2019

Yup, makes sense. I’ll run some tests to confirm that keep alives resolve this particular issue. If so, I’m fine with either approach.

@ramaraochavali
Copy link
Copy Markdown
Contributor

ramaraochavali commented Nov 19, 2019

Even if we set keep alives, which we already do here, https://github.com/istio/istio/blob/master/tools/packaging/common/envoy_bootstrap_v2.json#L403, as mentioned by @joeyb , the server side (ingress listener) runs in to stream idle timeout and it resets the stream to local cluster which will force pilot to disconnect the Envoy, IIRC.

I do not know if adding TCP keepalives to in.15010 cluster (local cluster) will help though - We can try may be?

@howardjohn
Copy link
Copy Markdown
Member

is the problem the keep alive time is 5min and the idle timeout is also 5min so there is a race? or if envoy is sending the keep alive packets it will still consider itself idle?

@hzxuzhonghu
Copy link
Copy Markdown
Member

Is stream_idle_timeout set to 0s in release 1.4? If so, in my env the period is still 30min.

[Envoy (Epoch 0)] [2019-11-19 16:57:16.647][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 17:25:38.528][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 17:55:49.454][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 18:28:14.198][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 18:57:37.335][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 19:26:16.387][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 19:53:51.741][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 20:23:17.094][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 20:54:30.689][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 21:23:03.173][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 21:52:25.804][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 22:24:36.710][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 22:55:11.721][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 23:28:14.761][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-19 23:58:38.330][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-20 00:27:55.067][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-20 00:57:19.065][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-20 01:26:51.212][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2019-11-20 01:55:36.659][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,

@howardjohn
Copy link
Copy Markdown
Member

@hzxuzhonghu the max connection is 30min, idle timeout is 5m. So its only 5min if there is no config sent at all, otherwise it would be 30min

@hzxuzhonghu
Copy link
Copy Markdown
Member

My cluster is silent, I believe it is the grpc keep alive that makes it alive.

@ramaraochavali
Copy link
Copy Markdown
Contributor

Was the grpc keep alive added recently? And just to be sure, are you running Envoy in front of Pilot? If not, you may see a different behaviour. Here is what we see - it disconnects every 5 mins

"message":"[2019-11-19 16:09:31.676][138][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
"message":"[2019-11-19 16:04:31.584][138][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
"message":"[2019-11-19 15:59:31.252][138][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
"message":"[2019-11-19 15:54:30.854][138][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,

@hzxuzhonghu
Copy link
Copy Markdown
Member

Weird, I installed by kubectl apply -f istio-demo.yaml from release-1.4

@lizan
Copy link
Copy Markdown
Contributor

lizan commented Nov 20, 2019

The TCP client (at TCP layer) doesn't reset the GRPC layer stream timeout, so even we set TCP keep alive the Envoy in front of pilot may kill the stream (not the connection) for the idle timeout.

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Nov 20, 2019
@ramaraochavali
Copy link
Copy Markdown
Contributor

/retest

@istio-testing
Copy link
Copy Markdown
Collaborator

istio-testing commented Nov 20, 2019

@joeyb: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
integ-mixer-k8s-tests_istio fb6dd12 link /test integ-mixer-k8s-tests_istio
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@joeyb
Copy link
Copy Markdown
Contributor Author

joeyb commented Nov 20, 2019

I ran some tests locally and confirmed that setting the tcp keepalive alone is not enough to reset the timeout. @hzxuzhonghu - like Rama mentioned, I suspect that maybe you don't have envoy running next to your pilot instance? Unless I'm missing it somewhere, I don't think we have gRPC keepalives configured.

Setting stream_idle_timeout=0 seems like the most straightforward way to solve this for now, but it seems reasonable to create a follow-up issue to look into whether or not gRPC keepalives do a better job of solving the problem.

Assuming we go with this approach, we'll also need a corresponding PR in the installer repo since it configures the pilot envoy bootstrap via a configmap rather than using the template included in the docker image.

@istio-testing istio-testing merged commit d0d20be into istio:master Nov 20, 2019
@joeyb joeyb deleted the pilot-stream_idle_timeout branch November 20, 2019 17:03
joeyb added a commit to joeyb/installer that referenced this pull request Nov 20, 2019
See istio/istio#19043 for prior discussion.

Both pilot and galley set max client ages in order to promote rebalancing, but the default `stream_idle_timeout` value of `5m` is lower than the default max client age of `30m`. This is causing reconnects to occur more frequently than intended. This change disables the idle timeout entirely, giving control back to pilot and galley over when the client connections are killed.

Mixer does not seem to set the max client age, so the `stream_idle_timeout` is only being set for the static listener for outbound connections to galley.
@hzxuzhonghu
Copy link
Copy Markdown
Member

root@beast-02:~/istio-1.4.0# kubectl get pod -n istio-system |grep pilot
istio-pilot-5dc796c7cc-mbkm5              2/2     Running     2          42h

I did have. I remmember we have grpc keep-alive enabled on pilot side, the period is 30s by default. I think this is related.

func (s *Server) grpcServerOptions(options *istiokeepalive.Options) []grpc.ServerOption {
	interceptors := []grpc.UnaryServerInterceptor{
		// setup server prometheus monitoring (as final interceptor in chain)
		prometheus.UnaryServerInterceptor,
	}

	// Temp setting, default should be enough for most supported environments. Can be used for testing
	// envoy with lower values.
	maxStreams := features.MaxConcurrentStreams

	grpcOptions := []grpc.ServerOption{
		grpc.UnaryInterceptor(middleware.ChainUnaryServer(interceptors...)),
		grpc.MaxConcurrentStreams(uint32(maxStreams)),
		grpc.KeepaliveParams(keepalive.ServerParameters{
			Time:                  options.Time,
			Timeout:               options.Timeout,
			MaxConnectionAge:      options.MaxServerConnectionAge,
			MaxConnectionAgeGrace: options.MaxServerConnectionAgeGrace,
		}),
	}

	return grpcOptions
}

@ramaraochavali
Copy link
Copy Markdown
Contributor

@hzxuzhonghu I think this keep alive config are for connection right? The connection may still be alive but the stream (ads stream with Envoy) might still get timed out for 5 mins if it is idle, because Envoy's stream timeout are per stream https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts#stream-timeouts

istio-testing pushed a commit to istio/installer that referenced this pull request Nov 22, 2019
See istio/istio#19043 for prior discussion.

Both pilot and galley set max client ages in order to promote rebalancing, but the default `stream_idle_timeout` value of `5m` is lower than the default max client age of `30m`. This is causing reconnects to occur more frequently than intended. This change disables the idle timeout entirely, giving control back to pilot and galley over when the client connections are killed.

Mixer does not seem to set the max client age, so the `stream_idle_timeout` is only being set for the static listener for outbound connections to galley.
sdake pushed a commit to sdake/istio that referenced this pull request Dec 1, 2019
istio#19043)

Envoy's default value for `stream_idle_timeout` is [5 minutes](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/network/http_connection_manager/v2/http_connection_manager.proto#envoy-api-field-config-filter-network-http-connection-manager-v2-httpconnectionmanager-stream-idle-timeout). Pilot expects to kill connections every 30 minutes to promote even connection balancing, but the lower default idle timeout can cause connections to be killed more rapidly than expected. This change disables the idle timeout entirely, giving control back to Pilot regarding when the connections are killed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants