Skip to content

rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419

@knz

Description

@knz

investigated by @nvanbenschoten and @erikgrinaker.

When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this (destination, rpc class) pair twice in order to establish a connection:

First, before creating a new RaftTransport queue:

if !t.dialer.GetCircuitBreaker(toNodeID, class).Ready() {

Second, when the RaftTransport queue started up and dialed the destination node:

if breaker != nil && !breaker.Ready() {
from
conn, err := t.dialer.Dial(ctx, toNodeID, class)

So the theory is that once a second, a request will make it through the first call to Breaker.Ready. However, when it does, it launches a new RaftTransport queue that immediately checks the breaker again. And since we haven't waited a second between calls to Breaker.Ready, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed here. Instead, we shut down the RaftTransport queue and start over again.

This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:

func TestTrippableBreakerState(t *testing.T) {
	c := clock.NewMock()
	cb := NewBreaker()
	cb.Clock = c
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready")
	}
	cb.Trip()
	if cb.Ready() {
		t.Fatal("expected breaker to not be ready")
	}
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout")
	}
+	if !cb.Ready() {
+		// !!! This fails !!!
+		// There's no breaker affinity.
+		t.Fatal("expected breaker to be ready after reset timeout")
+	}
	cb.Fail(nil)
	c.Add(cb.nextBackOff + 1)
	if !cb.Ready() {
		t.Fatal("expected breaker to be ready after reset timeout, post failure")
	}
}

So any code that requires two consecutive calls to a breaker's Ready() function in order to reset the breaker is bound to be starved forever.

It's not yet clear what the best fix is for this. One solution is to expose an option from nodedialer.Dialer.Dial to skip the breaker check. Another is to do something more clever around breaker affinity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-serverRelating to the KV-level RPC serverA-server-networkingPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-server-and-securityDB Server & Security

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions