Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail#12292
Merged
aanm merged 2 commits intocilium:masterfrom Jun 26, 2020
Merged
Conversation
A user reported an issue where Cilium would become unable to provision
new endpoints, with the following messages recurring in the agent logs:
[T=0s] ... level=info msg="New endpoint"
[T=0s] ... level=info msg="Resolving identity labels (blocking)
[T=43s]... level=error msg="Forcefully unlocking local kvstore lock"
[T=89s]... level=info msg="Removed endpoint"
Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:
goroutine 589 [select, 14 minutes]:
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10)
/go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc
github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...)
/go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0)
/go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a
github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0)
/go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46
...
github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4
/go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319
/go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708)
/go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98
github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def)
/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde
github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2)
/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a
github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0)
/go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1
created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController
/go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0
In particular, it's useful to note:
* The goroutine has been blocking for 14 minutes attempting to make progress
on re-establishing the etcd session.
* This path is hit by a controller responsible for handling etcd session
renewal in the event of etcd connectivity issues. It would not be triggered
by initial etcd connectivity failures, only when etcd connectivity was
successfully established then subsequently lost.
* NewSession() attempts to pick one of the configured etcd peers to connect to,
and will block until connectivity to that peer is restored. It will not by
itself back off and re-attempt connectivity to another node in the etcd
cluster.
* This path is in the critical section for writelock of `etcdClient.RWMutex`
so will block out all etcd client reads.
The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (`Forcefully unlocking local kvstore lock`), then later we see
cilium-cni / kubelet give up on creating the endpoint and remove it again.
This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time `NewSession()` is called, the etcd client library will
pick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.
In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.
Signed-off-by: Joe Stringer <joe@cilium.io>
Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
When renewSession() attempts to re-establish a session with an etcd peer, it may block for some period of time, currently 10s. At the moment, this timeout is added to the LockPath() timeout because the LockPath() context timeout is created *after* acquiring the lock. Move the context creation before the lock acquisition to allow these timers to overlap to provide more accurate bounding on the timeouts here. Signed-off-by: Joe Stringer <joe@cilium.io>
aanm
approved these changes
Jun 25, 2020
ianvernon
reviewed
Jun 25, 2020
ianvernon
approved these changes
Jun 25, 2020
Member
ianvernon
left a comment
There was a problem hiding this comment.
Thanks for providing all of the context in this PR :)
Member
Author
|
test-me-please |
joestringer
commented
Jun 25, 2020
| // integrate better with any code on the etcd | ||
| // client side. | ||
| parentCtx := ec.client.Ctx() | ||
| timedCtx, cancel := context.WithTimeout(parentCtx, statusCheckTimeout) |
Member
Author
There was a problem hiding this comment.
We could potentially consider using something like ClusterSizeDependantInterval() to scale this based on the cluster size.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A user reported an issue where after an etcd cluster outage, Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs:
Upon further investigation of "gops stack" output from a sysdump in the
environment, the following stack trace was observed on Cilium 1.6.8:
In particular, it's useful to note:
on re-establishing the etcd session.
renewal in the event of etcd connectivity issues. It would not be triggered
by initial etcd connectivity failures, only when etcd connectivity was
successfully established then subsequently lost.
and will block until connectivity to that peer is restored. It will not by
itself back off and re-attempt connectivity to another node in the etcd
cluster.
etcdClient.RWMutexso will block out all etcd client reads.
The above is consistent with the agent logs: Effectively, a new endpoint is
scheduled and Cilium attempts to allocate an identity for it. This process
requires the kvstore and will block on the lock. Given that it makes no
progress, first we see the lock liveness check kick to forcefully unlock
the path lock (
Forcefully unlocking local kvstore lock), then later we seecilium-cni / kubelet give up on creating the endpoint and remove it again.
The gops output also contained multiple goroutines blocked on this
lock with a call frame originating in an endpoint create AP call.
This patch fixes the issue by introducing a timeout when attempting to renew
the session. Each time
NewSession()is called, the etcd client library willpick an etcd peer to connect to. The intent behind the timeout is to provide
a way for Cilium to detect when it is attempting to connect to an unhealthy
peer, and try to reconnect to one of the other peers in the etcd cluster in
the hopes that that other peer will be healthy.
In the event that there is an etcd connectivity outage where one of
three etcd peers is unhealthy, we expect that the remaining two can
retain chorum and continue to operate despite the outage of the third
peer. In this case if Cilium was attempting to connect to the third
(unhealthy) peer, Cilium would previously block indefinitely. With this
patch Cilium will time out after statusCheckTimeout (10s) and
re-establish a session to the remaining etcd peers, thereby unblocking
subsequent endpoint provisioning.
This PR also aligns etcd path locking timeouts with potential session renew timeouts from the above logic.