pkg/kvstore: add gRPC keep alives for etcd connectivity#12947
Conversation
If the client does not receive a keep alive from the server, that connection should be closed so the etcd client library does proper round robin for the other available endpoints. This might be a little bit aggressive in a larger environment if all clients perform a keep alive requests to the etcd servers. Some testing could be done to verify if there is a large overhead of doing these keep alive requests. Signed-off-by: André Martins <andre@cilium.io>
|
test-me-please |
|
Green CI |
Signed-off-by: Thomas Graf <thomas@cilium.io>
| config.DialKeepAliveTime = clientOptions.KeepAliveHeartbeat | ||
| // Timeout if the server does not reply within 15 seconds and close the | ||
| // connection. Ideally it should be lower than staleLockTimeout | ||
| config.DialKeepAliveTimeout = clientOptions.KeepAliveTimeout |
There was a problem hiding this comment.
Relating to the large-scale cluster concern, we could consider adjusting this based on the cluster size like we do for some other time-based checks in Cilium?
There was a problem hiding this comment.
Something like ClusterSizeDependantInterval.
|
test-me-please |
joestringer
left a comment
There was a problem hiding this comment.
One more question came up during discussion with @christarazi: Should we only enable this for etcd client connections to the local etcd instance? Ie in clustermesh, avoid configuring this option?
I would treat them in the same way in general, the failover/stale connection problematic is exactly the same. |
If the client does not receive a keep alive from the server, that
connection should be closed so the etcd client library does proper
round robin for the other available endpoints.
This might be a little bit aggressive in a larger environment if all
clients perform a keep alive requests to the etcd servers. Some
testing could be done to verify if there is a large overhead of
doing these keep alive requests.
Signed-off-by: André Martins andre@cilium.io
Fixes #12945
TODO Perhaps it would also be a good idea to do have a hidden flag to configure this timeouts? Or even disable them?