rpc: circuit breaker livelock if 2nd check comes too fast after the 1st

investigated by @nvanbenschoten and @erikgrinaker. 

When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this `(destination, rpc class)` pair twice in order to establish a connection:

First, before creating a new RaftTransport queue: https://github.com/cockroachdb/cockroach/blob/9f1551030d68352d5fca5b3184c7b2edab69b1c7/pkg/kv/kvserver/raft_transport.go#L553

Second, when the RaftTransport queue started up and dialed the destination node: https://github.com/cockroachdb/cockroach/blob/9f1551030d68352d5fca5b3184c7b2edab69b1c7/pkg/rpc/nodedialer/nodedialer.go#L154 from https://github.com/cockroachdb/cockroach/blob/e1d01d047e833da659ee0c2bd570e5c4143c784e/pkg/kv/kvserver/raft_transport.go#L621

So the theory is that once a second, a request will make it through the first call to `Breaker.Ready`. However, when it does, it launches a new `RaftTransport` queue that immediately checks the breaker again. And since we haven't waited a second between calls to `Breaker.Ready`, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed [here](https://github.com/cockroachdb/cockroach/blob/9f1551030d68352d5fca5b3184c7b2edab69b1c7/pkg/rpc/nodedialer/nodedialer.go#L194). Instead, we shut down the `RaftTransport` queue and start over again.

This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:
```
func TestTrippableBreakerState(t *testing.T) {
	c := clock.NewMock()
	cb := NewBreaker()
	cb.Clock = c
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready")
	}
	cb.Trip()
	if cb.Ready() {
		t.Fatal("expected breaker to not be ready")
	}
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout")
	}
+	if !cb.Ready() {
+		// !!! This fails !!!
+		// There's no breaker affinity.
+		t.Fatal("expected breaker to be ready after reset timeout")
+	}
	cb.Fail(nil)
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout, post failure")
	}
}
```

So any code that requires two consecutive calls to a breaker's `Ready()` function in order to reset the breaker is bound to be starved forever.

It's not yet clear what the best fix is for this. One solution is to expose an option from `nodedialer.Dialer.Dial` to skip the breaker check. Another is to do something more clever around breaker affinity.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions