Conversation
75d39f7 to
ff1cae1
Compare
bd6dbd1 to
37e76cb
Compare
29cfbbd to
60a9692
Compare
|
test-me-please |
joestringer
left a comment
There was a problem hiding this comment.
Metrics side looks good. I have a few questions around introspection of the new state and a few around docs enhancements.
AFAIK in the event of a complete etcd cluster loss, Cilium still retains a 15-minute timeout on startup where it attempts to establish a healthy connection to an etcd cluster with quorum before exiting. This aspect is not really covered in this PR given that it's more focused on partial outages, but just wondering whether you had been looking at "complete etcd outage" type scenarios with the new logic in mind? At a glance I think it should be orthogonal, no specific concerns come to mind about how that case might be impacted by this new logic.
1e8cf0e to
e08f7a6
Compare
e08f7a6 to
1cf55de
Compare
joestringer
left a comment
There was a problem hiding this comment.
One more main note on the controller change and visibility below, the rest is minor nits that could be addressed here or as a follow up.
1cf55de to
694ff58
Compare
694ff58 to
1648301
Compare
|
test-me-please |
1648301 to
f99a67d
Compare
Signed-off-by: Thomas Graf <thomas@cilium.io>
Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing: ``` KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader) ``` When enough consecutive errors have accumulated, the kvstore subsystem will start failing: ``` KVStore: Failure Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received ``` Signed-off-by: Thomas Graf <thomas@cilium.io>
Clustermesh is never performing write operations so the lock-based quorum check is only adding contention to remote etcds. Signed-off-by: Thomas Graf <thomas@cilium.io>
Watch the status of the etcd conection and restart the connection if quorum loss is detected. Given that lock acquisition is disabled for clustermesh, the quorum check equals to the ability to receive updates on the heartbat key. Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
"Backend not initialized" does not mean much to users. Signed-off-by: Thomas Graf <thomas@cilium.io>
Reported-by: @sayboras Signed-off-by: Thomas Graf <thomas@cilium.io>
The initial status message of the etcd subsystem is: ``` KVStore: Ok No connection to etcd ``` This can be misleading as it does not indicate whether the etcd session was ever established or not. Clarify this: ``` KVStore: Ok Waiting for initial connection to be established ``` Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
When releasing the etcd connection, sessions are attempted to be revoked. In the event of an unhealthy etcd connection, the operation will fail and time out. This operation will take a long time though. Instead of blocking, release the resources in the background. Signed-off-by: Thomas Graf <thomas@cilium.io>
f38e585 to
2e1e97a
Compare
Good condition: ``` cluster2: ready, 4 nodes, 3 identities, 1 services, 0 failures (last: never) ``` Bad condition: ``` cluster2: not-ready, 0 nodes, 0 identities, 0 services, 1 failures (last: 9s ago) ``` Signed-off-by: Thomas Graf <thomas@cilium.io>
2e1e97a to
0aa046e
Compare
|
test-me-please |
|
@tgraf by default, how long would this take from an etcd outage (or cilium-operator node becoming unavailable) before Cilium agents begin restarting? |
Depends on #12427
Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing:
When enough consecutive errors have accumulated, the kvstore subsystem will start failing: