Skip to content

Add heartbeat to etcd quorum check#12453

Merged
tgraf merged 11 commits intomasterfrom
pr/tgraf/etcd-heartbeat
Jul 15, 2020
Merged

Add heartbeat to etcd quorum check#12453
tgraf merged 11 commits intomasterfrom
pr/tgraf/etcd-heartbeat

Conversation

@tgraf
Copy link
Copy Markdown
Contributor

@tgraf tgraf commented Jul 8, 2020

Depends on #12427

Adds a heartbeat written to a key in an interval (1min) by the operator. Each etcd client installs a watcher to watch the heartbeat key. When the heartbeat is not updated in 2*interval, the quorum check will start failing:

KVStore:                Ok   etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader)

When enough consecutive errors have accumulated, the kvstore subsystem will start failing:

KVStore:                Failure   Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received

@tgraf tgraf added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jul 8, 2020
@tgraf tgraf requested review from a team as code owners July 8, 2020 08:01
@tgraf tgraf marked this pull request as draft July 8, 2020 08:01
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 75d39f7 to ff1cae1 Compare July 8, 2020 08:03
@coveralls
Copy link
Copy Markdown

coveralls commented Jul 8, 2020

Coverage Status

Coverage increased (+0.03%) to 37.028% when pulling 0aa046e on pr/tgraf/etcd-heartbeat into 38d9bf5 on master.

@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from bd6dbd1 to 37e76cb Compare July 8, 2020 16:05
@tgraf tgraf added area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. needs-backport/1.8 labels Jul 8, 2020
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from 29cfbbd to 60a9692 Compare July 10, 2020 08:02
@tgraf tgraf marked this pull request as ready for review July 10, 2020 08:02
@tgraf tgraf requested review from a team as code owners July 10, 2020 08:02
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 10, 2020

test-me-please

Comment thread pkg/kvstore/etcd.go Outdated
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread pkg/clustermesh/remote_cluster.go Outdated
Comment thread pkg/kvstore/backend.go Outdated
Comment thread pkg/clustermesh/remote_cluster.go Outdated
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics side looks good. I have a few questions around introspection of the new state and a few around docs enhancements.

AFAIK in the event of a complete etcd cluster loss, Cilium still retains a 15-minute timeout on startup where it attempts to establish a healthy connection to an etcd cluster with quorum before exiting. This aspect is not really covered in this PR given that it's more focused on partial outages, but just wondering whether you had been looking at "complete etcd outage" type scenarios with the new logic in mind? At a glance I think it should be orthogonal, no specific concerns come to mind about how that case might be impacted by this new logic.

Comment thread operator/kvstore_watchdog.go Outdated
Comment thread operator/kvstore_watchdog.go
Comment thread pkg/kvstore/kvstore.go
Comment thread pkg/kvstore/etcd.go
Comment thread pkg/clustermesh/remote_cluster.go
Comment thread Documentation/troubleshooting.rst
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread Documentation/troubleshooting.rst
Comment thread Documentation/troubleshooting.rst
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from 1e8cf0e to e08f7a6 Compare July 13, 2020 14:31
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from e08f7a6 to 1cf55de Compare July 13, 2020 15:29
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more main note on the controller change and visibility below, the rest is minor nits that could be addressed here or as a follow up.

Comment thread Documentation/kvstore.rst
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread Documentation/troubleshooting.rst Outdated
Comment thread Documentation/troubleshooting.rst
Comment thread Documentation/troubleshooting.rst
Comment thread pkg/clustermesh/remote_cluster.go Outdated
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 1cf55de to 694ff58 Compare July 14, 2020 12:35
@tgraf tgraf requested a review from a team as a code owner July 14, 2020 12:35
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 694ff58 to 1648301 Compare July 14, 2020 12:51
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 14, 2020

test-me-please

@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 1648301 to f99a67d Compare July 14, 2020 16:36
tgraf added 10 commits July 15, 2020 10:07
Signed-off-by: Thomas Graf <thomas@cilium.io>
Adds a heartbeat written to a key in an interval (1min) by the operator.
Each etcd client installs a watcher to watch the heartbeat key. When the
heartbeat is not updated in 2*interval, the quorum check will start
failing:

```
KVStore:                Ok   etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.33.11:2379 - 3.4.9 (Leader)
```

When enough consecutive errors have accumulated, the kvstore subsystem
will start failing:
```
KVStore:                Failure   Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received
```

Signed-off-by: Thomas Graf <thomas@cilium.io>
Clustermesh is never performing write operations so the lock-based
quorum check is only adding contention to remote etcds.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Watch the status of the etcd conection and restart the connection if
quorum loss is detected. Given that lock acquisition is disabled for
clustermesh, the quorum check equals to the ability to receive updates
on the heartbat key.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
"Backend not initialized" does not mean much to users.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Reported-by: @sayboras
Signed-off-by: Thomas Graf <thomas@cilium.io>
The initial status message of the etcd subsystem is:

```
KVStore:                Ok   No connection to etcd
```

This can be misleading as it does not indicate whether the etcd session
was ever established or not. Clarify this:

```
KVStore:                Ok   Waiting for initial connection to be established
```

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
When releasing the etcd connection, sessions are attempted to be
revoked. In the event of an unhealthy etcd connection, the operation
will fail and time out. This operation will take a long time though.
Instead of blocking, release the resources in the background.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch 2 times, most recently from f38e585 to 2e1e97a Compare July 15, 2020 11:32
Good condition:
```
   cluster2: ready, 4 nodes, 3 identities, 1 services, 0 failures (last: never)
```

Bad condition:
```
   cluster2: not-ready, 0 nodes, 0 identities, 0 services, 1 failures (last: 9s ago)
```

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/etcd-heartbeat branch from 2e1e97a to 0aa046e Compare July 15, 2020 12:13
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 15, 2020

test-me-please

@joestringer
Copy link
Copy Markdown
Member

@tgraf by default, how long would this take from an etcd outage (or cilium-operator node becoming unavailable) before Cilium agents begin restarting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants