pkg/kvstore: add gRPC keep alives for etcd connectivity by aanm · Pull Request #12947 · cilium/cilium

aanm · 2020-08-21T11:21:52Z

If the client does not receive a keep alive from the server, that
connection should be closed so the etcd client library does proper
round robin for the other available endpoints.

This might be a little bit aggressive in a larger environment if all
clients perform a keep alive requests to the etcd servers. Some
testing could be done to verify if there is a large overhead of
doing these keep alive requests.

Signed-off-by: André Martins andre@cilium.io

Fixes #12945

TODO Perhaps it would also be a good idea to do have a hidden flag to configure this timeouts? Or even disable them?

Improved reliability of etcd connectivity by adding gRPC keep alives

If the client does not receive a keep alive from the server, that connection should be closed so the etcd client library does proper round robin for the other available endpoints. This might be a little bit aggressive in a larger environment if all clients perform a keep alive requests to the etcd servers. Some testing could be done to verify if there is a large overhead of doing these keep alive requests. Signed-off-by: André Martins <andre@cilium.io>

aanm · 2020-08-21T11:22:10Z

test-me-please

tgraf · 2020-08-21T15:37:44Z

Green CI

Signed-off-by: Thomas Graf <thomas@cilium.io>

joestringer · 2020-08-21T16:43:32Z

+	config.DialKeepAliveTime = clientOptions.KeepAliveHeartbeat
+	// Timeout if the server does not reply within 15 seconds and close the
+	// connection. Ideally it should be lower than staleLockTimeout
+	config.DialKeepAliveTimeout = clientOptions.KeepAliveTimeout


Relating to the large-scale cluster concern, we could consider adjusting this based on the cluster size like we do for some other time-based checks in Cilium?

Something like ClusterSizeDependantInterval.

christarazi · 2020-08-21T17:26:53Z

test-me-please

joestringer

One more question came up during discussion with @christarazi: Should we only enable this for etcd client connections to the local etcd instance? Ie in clustermesh, avoid configuring this option?

tgraf · 2020-08-24T09:39:14Z

One more question came up during discussion with @christarazi: Should we only enable this for etcd client connections to the local etcd instance? Ie in clustermesh, avoid configuring this option?

I would treat them in the same way in general, the failover/stale connection problematic is exactly the same.

aanm added kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.7 labels Aug 21, 2020

aanm requested a review from a team as a code owner August 21, 2020 11:21

etcd: Make keepalive interval and timeout configurable

c64e8f6

Signed-off-by: Thomas Graf <thomas@cilium.io>

joestringer approved these changes Aug 21, 2020

View reviewed changes

joestringer reviewed Aug 21, 2020

View reviewed changes

christarazi approved these changes Aug 21, 2020

View reviewed changes

joestringer reviewed Aug 21, 2020

View reviewed changes

tgraf merged commit a4a1df0 into master Aug 24, 2020

tgraf deleted the pr/fix-etcd-stale-requests branch August 24, 2020 09:39

brb mentioned this pull request Aug 25, 2020

v1.8 backports 2020-08-25 #12963

Merged

brb added backport-pending/1.8 and removed needs-backport/1.8 labels Aug 25, 2020

joestringer added backport-done/1.8 and removed backport-pending/1.8 labels Aug 26, 2020

christarazi mentioned this pull request Aug 26, 2020

allocator: Fatal on timeout & unavailable backend #12935

Closed

jrfastab added backport-pending/1.7 and removed needs-backport/1.7 labels Aug 27, 2020

This was referenced Aug 27, 2020

v1.7 backports 2020-08-27 #12991

Closed

v1.7 backports 2020-08-27 #12992

Merged

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/kvstore: add gRPC keep alives for etcd connectivity#12947

pkg/kvstore: add gRPC keep alives for etcd connectivity#12947
tgraf merged 2 commits intomasterfrom
pr/fix-etcd-stale-requests

aanm commented Aug 21, 2020 •

edited

Loading

Uh oh!

aanm commented Aug 21, 2020

Uh oh!

tgraf commented Aug 21, 2020

Uh oh!

joestringer Aug 21, 2020

Uh oh!

joestringer Aug 21, 2020

Uh oh!

christarazi commented Aug 21, 2020

Uh oh!

joestringer left a comment

Uh oh!

tgraf commented Aug 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

aanm commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aanm commented Aug 21, 2020

Uh oh!

tgraf commented Aug 21, 2020

Uh oh!

joestringer Aug 21, 2020

Choose a reason for hiding this comment

Uh oh!

joestringer Aug 21, 2020

Choose a reason for hiding this comment

Uh oh!

christarazi commented Aug 21, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

tgraf commented Aug 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aanm commented Aug 21, 2020 •

edited

Loading