Skip to content

etcd: Fix several etcd related issues#12605

Merged
rolinh merged 7 commits intomasterfrom
pr/tgraf/etcd-bugs2
Jul 23, 2020
Merged

etcd: Fix several etcd related issues#12605
rolinh merged 7 commits intomasterfrom
pr/tgraf/etcd-bugs2

Conversation

@tgraf
Copy link
Copy Markdown
Contributor

@tgraf tgraf commented Jul 21, 2020

See individual commits. The most important fix is in the last commit.

@tgraf tgraf added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.7 labels Jul 21, 2020
@tgraf tgraf requested a review from a team July 21, 2020 12:55
@tgraf tgraf requested a review from a team as a code owner July 21, 2020 12:55
@tgraf tgraf force-pushed the pr/tgraf/etcd-bugs2 branch from 849d7f3 to e446a2a Compare July 21, 2020 13:03
@tgraf tgraf marked this pull request as draft July 21, 2020 13:17
Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be return ctx.Err()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still returning nil after all changes unlike the previous case. The code is choosing to not return an error to not trigger a retry when the controller stops or the etcd client shuts down.

Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/k8s/cnp.go Outdated
@aanm aanm requested a review from a team July 21, 2020 13:22
@coveralls
Copy link
Copy Markdown

coveralls commented Jul 21, 2020

Coverage Status

Coverage increased (+0.007%) to 37.189% when pulling f8e5a5c on pr/tgraf/etcd-bugs2 into 6783d32 on master.

@tgraf tgraf force-pushed the pr/tgraf/etcd-bugs2 branch 2 times, most recently from cc462a6 to fb1bbeb Compare July 21, 2020 14:30
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 21, 2020

test-me-please

@tgraf tgraf force-pushed the pr/tgraf/etcd-bugs2 branch from fb1bbeb to d58c811 Compare July 22, 2020 13:42
@tgraf tgraf changed the title etcd: Fix etcd operations blocking on firstSession forever etcd: Fix several etcd related issues Jul 22, 2020
@tgraf tgraf force-pushed the pr/tgraf/etcd-bugs2 branch 3 times, most recently from a7fc092 to 7db5d5b Compare July 22, 2020 14:01
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 22, 2020

test-me-please

@tgraf tgraf marked this pull request as ready for review July 22, 2020 14:46
@tgraf tgraf requested a review from a team as a code owner July 22, 2020 14:46
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question below on errChan vs. errorChan that needs to be checked.

The rest are misc. nits. Not sure whether we've collectively decided on errors.Wrap() but I know some parts of the code have started using it already.

Comment thread pkg/k8s/cnp.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/contexthelpers/context_test.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jul 22, 2020

retest-4.19

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at the surface; not too familiar with this code so I'll rely on others' reviews on that. Reviewed changes from a semantic perspective mostly.

Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/contexthelpers/context_test.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
Comment thread pkg/kvstore/etcd.go Outdated
@maintainer-s-little-helper maintainer-s-little-helper Bot requested review from aanm and removed request for a team July 22, 2020 20:44
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions asked & answered. Minor nits can be addressed now or as followup.

tgraf added 2 commits July 22, 2020 23:58
Same as ba9031b ("etcd: Ensure that session renewal controller exit") but
for the lock session.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Commit 19ad311 incorrectly changed the return value to nil instead of
returning an error.

Fixes: 19ad311 ("kvstore: Fix Watch() to return when client is closed or context is cancelled")
Signed-off-by: Thomas Graf <thomas@cilium.io>
tgraf added 5 commits July 23, 2020 13:21
The firstSession channel has been used to allow blocking until the initial
session has been established. However, because callers have not been handling
errors, it was not possible to close the channel if the etcd client was shut
down before the session was ever established and thus the channel was leaked.

This was even more problematic for etcd operations performed without a context
with a timeout as most operations currently block on firstSession or context.

Consolidate the errChan and firstSession as they effectively served the same
purpose. The only slight change is that the firstSession channel is closed
*after* the etcd version has been verified which is the right behavior anyway.

Having the error condition returned via the firstSession channel allows to
return it to owning controllers as well when renewing sessions for better
visibility.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
The context passed into NewSession() was supposed to enforce a timeout on the
NewSession() operation which is triggering a Grant() and KeepAlive()
instruction. However, the context passed into NewSession() will also be
associated with the resulting lease. As per etcd documentation, this will
result in:

> If the context is canceled before Close() completes, the session's lease will
> be abandoned and left to expire instead of being revoked.

Because of this, any session renewal triggering a new session would create a
session that is immediately closed again due to the context passed into
NewSession being cancelled when the controller run ends successfully. This
resulted in any renewed session to have an effective lifetime of 10
milliseconds before requiring renewal again.

Fixes: #12619
Fixes: 8524fca ("kvstore: Add session renew backoff")

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/etcd-bugs2 branch from 43670ff to f8e5a5c Compare July 23, 2020 11:22
@pchaigno
Copy link
Copy Markdown
Member

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 23, 2020
@rolinh rolinh merged commit 0262854 into master Jul 23, 2020
@rolinh rolinh deleted the pr/tgraf/etcd-bugs2 branch July 23, 2020 14:06
@qmonnet
Copy link
Copy Markdown
Member

qmonnet commented Aug 3, 2020

Partial backport to v1.6 in #12748.

@qmonnet
Copy link
Copy Markdown
Member

qmonnet commented Aug 3, 2020

I'm labelling as backport-done/1.6. However, note that only 3 out of 7 commits have been backported in the aforementioned PR:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants