Skip to content

pkg/kvstore: add gRPC keep alives for etcd connectivity#12947

Merged
tgraf merged 2 commits intomasterfrom
pr/fix-etcd-stale-requests
Aug 24, 2020
Merged

pkg/kvstore: add gRPC keep alives for etcd connectivity#12947
tgraf merged 2 commits intomasterfrom
pr/fix-etcd-stale-requests

Conversation

@aanm
Copy link
Copy Markdown
Member

@aanm aanm commented Aug 21, 2020

If the client does not receive a keep alive from the server, that
connection should be closed so the etcd client library does proper
round robin for the other available endpoints.

This might be a little bit aggressive in a larger environment if all
clients perform a keep alive requests to the etcd servers. Some
testing could be done to verify if there is a large overhead of
doing these keep alive requests.

Signed-off-by: André Martins andre@cilium.io

Fixes #12945

TODO Perhaps it would also be a good idea to do have a hidden flag to configure this timeouts? Or even disable them?

Improved reliability of etcd connectivity by adding gRPC keep alives

If the client does not receive a keep alive from the server, that
connection should be closed so the etcd client library does proper
round robin for the other available endpoints.

This might be a little bit aggressive in a larger environment if all
clients perform a keep alive requests to the etcd servers. Some
testing could be done to verify if there is a large overhead of
doing these keep alive requests.

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm added kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.7 labels Aug 21, 2020
@aanm aanm requested a review from a team as a code owner August 21, 2020 11:21
@aanm
Copy link
Copy Markdown
Member Author

aanm commented Aug 21, 2020

test-me-please

@tgraf
Copy link
Copy Markdown
Contributor

tgraf commented Aug 21, 2020

Green CI

Signed-off-by: Thomas Graf <thomas@cilium.io>
Comment thread pkg/kvstore/etcd.go
Comment on lines +684 to +687
config.DialKeepAliveTime = clientOptions.KeepAliveHeartbeat
// Timeout if the server does not reply within 15 seconds and close the
// connection. Ideally it should be lower than staleLockTimeout
config.DialKeepAliveTimeout = clientOptions.KeepAliveTimeout
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relating to the large-scale cluster concern, we could consider adjusting this based on the cluster size like we do for some other time-based checks in Cilium?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like ClusterSizeDependantInterval.

@christarazi
Copy link
Copy Markdown
Member

test-me-please

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question came up during discussion with @christarazi: Should we only enable this for etcd client connections to the local etcd instance? Ie in clustermesh, avoid configuring this option?

@tgraf
Copy link
Copy Markdown
Contributor

tgraf commented Aug 24, 2020

One more question came up during discussion with @christarazi: Should we only enable this for etcd client connections to the local etcd instance? Ie in clustermesh, avoid configuring this option?

I would treat them in the same way in general, the failover/stale connection problematic is exactly the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod can be stuck in ContainerCreating state for ~15m with identity-allocation-mode=kvstore

6 participants