-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419
Description
investigated by @nvanbenschoten and @erikgrinaker.
When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this (destination, rpc class) pair twice in order to establish a connection:
First, before creating a new RaftTransport queue:
cockroach/pkg/kv/kvserver/raft_transport.go
Line 553 in 9f15510
| if !t.dialer.GetCircuitBreaker(toNodeID, class).Ready() { |
Second, when the RaftTransport queue started up and dialed the destination node:
cockroach/pkg/rpc/nodedialer/nodedialer.go
Line 154 in 9f15510
| if breaker != nil && !breaker.Ready() { |
cockroach/pkg/kv/kvserver/raft_transport.go
Line 621 in e1d01d0
| conn, err := t.dialer.Dial(ctx, toNodeID, class) |
So the theory is that once a second, a request will make it through the first call to Breaker.Ready. However, when it does, it launches a new RaftTransport queue that immediately checks the breaker again. And since we haven't waited a second between calls to Breaker.Ready, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed here. Instead, we shut down the RaftTransport queue and start over again.
This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:
func TestTrippableBreakerState(t *testing.T) {
c := clock.NewMock()
cb := NewBreaker()
cb.Clock = c
if !cb.Ready() {
t.Fatal("expected breaker to be ready")
}
cb.Trip()
if cb.Ready() {
t.Fatal("expected breaker to not be ready")
}
c.Add(cb.nextBackOff + 1)
if !cb.Ready() {
t.Fatal("expected breaker to be ready after reset timeout")
}
+ if !cb.Ready() {
+ // !!! This fails !!!
+ // There's no breaker affinity.
+ t.Fatal("expected breaker to be ready after reset timeout")
+ }
cb.Fail(nil)
c.Add(cb.nextBackOff + 1)
if !cb.Ready() {
t.Fatal("expected breaker to be ready after reset timeout, post failure")
}
}
So any code that requires two consecutive calls to a breaker's Ready() function in order to reset the breaker is bound to be starved forever.
It's not yet clear what the best fix is for this. One solution is to expose an option from nodedialer.Dialer.Dial to skip the breaker check. Another is to do something more clever around breaker affinity.