Skip to content

Improve etcd fail-over scenarios#12427

Merged
tgraf merged 4 commits intomasterfrom
pr/tgraf/etcd-failover
Jul 8, 2020
Merged

Improve etcd fail-over scenarios#12427
tgraf merged 4 commits intomasterfrom
pr/tgraf/etcd-failover

Conversation

@tgraf
Copy link
Copy Markdown
Contributor

@tgraf tgraf commented Jul 6, 2020

The main change introduced by this PR Is to fail Cilium status if etcd quorum cannot be detected for 3 consecutive status intervals. By triggering a restart of the agent, the situation can hopefully be resolved. In addition, the endpoint list is shuffled on bootstrap to further improve the chances to resolve the issue.

tgraf added 2 commits July 6, 2020 14:47
Instead of just a boolean, return the reason of failure in
isConnectedAndHasQuorum()

Signed-off-by: Thomas Graf <thomas@cilium.io>
In general, the etcd client library should fail over to a healthy etcd
endpoint and quorum errors should automatically resolve. If this does
not happen for some reason, report unhealthy status to trigger a restart
of the agent.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jul 6, 2020
@tgraf tgraf requested review from a team as code owners July 6, 2020 12:48
@tgraf tgraf marked this pull request as draft July 6, 2020 12:48
@tgraf tgraf force-pushed the pr/tgraf/etcd-failover branch from 965f374 to 0c3c493 Compare July 6, 2020 12:53
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 6, 2020

test-me-please

@tgraf tgraf marked this pull request as ready for review July 6, 2020 13:53
Copy link
Copy Markdown
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern on the tests, looks good to me otherwise.

Comment thread pkg/kvstore/etcd_test.go Outdated
Comment thread pkg/rand/safe_rand_test.go Outdated
tgraf added 2 commits July 8, 2020 09:58
Shuffling the list of etcd endpoints on each bootstrap has two positive
effects:
 * Agents in the cluster will connect to different etcd members.
 * In case etcd fails to fail-over to due to a bug, a Cilium agent
   restart has a chance to fail-over to a healthy etcd member.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/etcd-failover branch from 0c3c493 to 6316e8b Compare July 8, 2020 07:59
@tgraf tgraf requested a review from qmonnet July 8, 2020 07:59
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.002%) to 36.933% when pulling 6316e8b on pr/tgraf/etcd-failover into e8d7307 on master.

Copy link
Copy Markdown
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 8, 2020

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 8, 2020
@tgraf tgraf merged commit 53c17cd into master Jul 8, 2020
@tgraf tgraf deleted the pr/tgraf/etcd-failover branch July 8, 2020 14:47
@brb brb mentioned this pull request Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants