Add heartbeat to etcd quorum check by tgraf · Pull Request #12453 · cilium/cilium

tgraf · 2020-07-08T08:01:50Z

Depends on #12427

Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing:

KVStore:                Ok   etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader)

When enough consecutive errors have accumulated, the kvstore subsystem will start failing:

KVStore:                Failure   Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received

coveralls · 2020-07-08T09:01:01Z

Coverage increased (+0.03%) to 37.028% when pulling 0aa046e on pr/tgraf/etcd-heartbeat into 38d9bf5 on master.

tgraf · 2020-07-10T08:31:09Z

test-me-please

joestringer

Metrics side looks good. I have a few questions around introspection of the new state and a few around docs enhancements.

AFAIK in the event of a complete etcd cluster loss, Cilium still retains a 15-minute timeout on startup where it attempts to establish a healthy connection to an etcd cluster with quorum before exiting. This aspect is not really covered in this PR given that it's more focused on partial outages, but just wondering whether you had been looking at "complete etcd outage" type scenarios with the new logic in mind? At a glance I think it should be orthogonal, no specific concerns come to mind about how that case might be impacted by this new logic.

joestringer

One more main note on the controller change and visibility below, the rest is minor nits that could be addressed here or as a follow up.

tgraf · 2020-07-14T13:06:24Z

test-me-please

Signed-off-by: Thomas Graf <thomas@cilium.io>

Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing: ``` KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader) ``` When enough consecutive errors have accumulated, the kvstore subsystem will start failing: ``` KVStore: Failure Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received ``` Signed-off-by: Thomas Graf <thomas@cilium.io>

Clustermesh is never performing write operations so the lock-based quorum check is only adding contention to remote etcds. Signed-off-by: Thomas Graf <thomas@cilium.io>

Watch the status of the etcd conection and restart the connection if quorum loss is detected. Given that lock acquisition is disabled for clustermesh, the quorum check equals to the ability to receive updates on the heartbat key. Signed-off-by: Thomas Graf <thomas@cilium.io>

Signed-off-by: Thomas Graf <thomas@cilium.io>

"Backend not initialized" does not mean much to users. Signed-off-by: Thomas Graf <thomas@cilium.io>

@sayboras

Reported-by: @sayboras Signed-off-by: Thomas Graf <thomas@cilium.io>

The initial status message of the etcd subsystem is: ``` KVStore: Ok No connection to etcd ``` This can be misleading as it does not indicate whether the etcd session was ever established or not. Clarify this: ``` KVStore: Ok Waiting for initial connection to be established ``` Signed-off-by: Thomas Graf <thomas@cilium.io>

Signed-off-by: Thomas Graf <thomas@cilium.io>

When releasing the etcd connection, sessions are attempted to be revoked. In the event of an unhealthy etcd connection, the operation will fail and time out. This operation will take a long time though. Instead of blocking, release the resources in the background. Signed-off-by: Thomas Graf <thomas@cilium.io>

Good condition: ``` cluster2: ready, 4 nodes, 3 identities, 1 services, 0 failures (last: never) ``` Bad condition: ``` cluster2: not-ready, 0 nodes, 0 identities, 0 services, 1 failures (last: 9s ago) ``` Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf · 2020-07-15T12:25:43Z

test-me-please

joestringer · 2020-07-31T21:53:09Z

@tgraf by default, how long would this take from an etcd outage (or cilium-operator node becoming unavailable) before Cilium agents begin restarting?

tgraf added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jul 8, 2020

tgraf requested review from a team as code owners July 8, 2020 08:01

tgraf marked this pull request as draft July 8, 2020 08:01

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 75d39f7 to ff1cae1 Compare July 8, 2020 08:03

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from bd6dbd1 to 37e76cb Compare July 8, 2020 16:05

tgraf added area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. needs-backport/1.8 labels Jul 8, 2020

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from 29cfbbd to 60a9692 Compare July 10, 2020 08:02

tgraf marked this pull request as ready for review July 10, 2020 08:02

tgraf requested review from a team as code owners July 10, 2020 08:02

qmonnet requested changes Jul 10, 2020

View reviewed changes

Comment thread pkg/kvstore/etcd.go Outdated

Comment thread Documentation/troubleshooting.rst Outdated

Comment thread Documentation/troubleshooting.rst Outdated

Comment thread Documentation/troubleshooting.rst Outdated

sayboras reviewed Jul 10, 2020

View reviewed changes

Comment thread pkg/clustermesh/remote_cluster.go Outdated

Comment thread pkg/kvstore/backend.go Outdated

Comment thread pkg/clustermesh/remote_cluster.go Outdated

joestringer requested changes Jul 10, 2020

View reviewed changes

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from 1e8cf0e to e08f7a6 Compare July 13, 2020 14:31

qmonnet approved these changes Jul 13, 2020

View reviewed changes

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from e08f7a6 to 1cf55de Compare July 13, 2020 15:29

joestringer requested changes Jul 13, 2020

View reviewed changes

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 1cf55de to 694ff58 Compare July 14, 2020 12:35

tgraf requested a review from a team as a code owner July 14, 2020 12:35

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 694ff58 to 1648301 Compare July 14, 2020 12:51

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 1648301 to f99a67d Compare July 14, 2020 16:36

tgraf added 10 commits July 15, 2020 10:07

kvstore: Add metric to count quorum errors

ff7b0d4

Signed-off-by: Thomas Graf <thomas@cilium.io>

clustermesh: Disable initlock quorum check

aeabfb9

Clustermesh is never performing write operations so the lock-based quorum check is only adding contention to remote etcds. Signed-off-by: Thomas Graf <thomas@cilium.io>

doc: Document etcd failure behavior

7b4acc0

Signed-off-by: Thomas Graf <thomas@cilium.io>

clustermesh: Improve error message for inital connection attempt

b6d5874

"Backend not initialized" does not mean much to users. Signed-off-by: Thomas Graf <thomas@cilium.io>

clustermesh: Fix comments on scope of rc.mutex

f534dd4

Reported-by: @sayboras Signed-off-by: Thomas Graf <thomas@cilium.io>

kvstore: Log errors while closing etcd client

86a1972

Signed-off-by: Thomas Graf <thomas@cilium.io>

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from f38e585 to 2e1e97a Compare July 15, 2020 11:32

tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 2e1e97a to 0aa046e Compare July 15, 2020 12:13

maintainer-s-little-helper Bot requested a review from joestringer July 15, 2020 14:04

tgraf merged commit 3e951e1 into master Jul 15, 2020

tgraf deleted the pr/tgraf/etcd-heartbeat branch July 15, 2020 14:11

tgraf added the needs-backport/1.7 label Jul 15, 2020

tgraf mentioned this pull request Jul 15, 2020

[1.7] etcd/clustermesh related backports #12534

Merged

tgraf added backport-pending/1.7 and removed needs-backport/1.7 labels Jul 15, 2020

brb mentioned this pull request Jul 15, 2020

v1.8 backports 2020-07-15 #12536

Merged

brb added backport-pending/1.8 and removed needs-backport/1.8 labels Jul 15, 2020

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Jul 15, 2020

christarazi added backport-done/1.8 and removed backport-pending/1.8 labels Jul 20, 2020

Conversation

tgraf commented Jul 8, 2020

Uh oh!

coveralls commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgraf commented Jul 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgraf commented Jul 14, 2020

Uh oh!

tgraf commented Jul 15, 2020

Uh oh!

joestringer commented Jul 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

coveralls commented Jul 8, 2020 •

edited

Loading