Unavailability due to lease transfer onto replica waiting for Raft snapshot

We've seen two customers have issues with leases being transferred to nodes that are still waiting for Raft snapshots, and if there are many snapshots queued up (as in the case of a new node joining a cluster with lots of ranges) this can cause the range to be unavailable for a significant period.

@nvanbenschoten writes in an internal ticket:

The high-level overview of what went wrong from the customer's perspective was that during a period of concurrent upreplication to an incoming node and decommissioning of an outgoing node, they saw two periods of high SQL latency, each lasting for a few minutes. We found that during each of these outages, one range was effectively unavailable.

We poked around in the graphs and found that during these periods, we had a high frequency of `NotLeaseholderErrors`. Our hunch was that we had a lease transfer to a replica that was behind on its Raft log, so we entered a redirection loop (not a hot one, we exponentially backoff) where the outgoing leaseholder redirected to the incoming leaseholder, and vice versa. Since the new leaseholder needs to apply the Raft log up to the lease transfer before it can take over as the leaseholder, we essentially had no leaseholder for a few minutes. We confirmed that this was the case and we able to grab logs from the problem range in each instance.

The logs show that in each case, the lease was transferred to a replica shortly after it was added to a range and promoted from a LEARNER to a VOTER. The leaseholder then waited for a few minutes to apply a `VIA_SNAPSHOT_QUEUE` snapshot before it was able to start serving traffic. So presumably, the snapshot contained the lease transfer.

So what does this mean? It means that the outgoing leaseholder transferred its lease and then the leader (same replica? unclear) truncated the Raft log shortly afterward. This preventing the incoming leaseholder from applying the lease transfer through standard log application. Instead, the leader needed to send it a Raft snapshot to catch it up. And this snapshot was delayed for a few minutes because there were so many other snapshots in the system at the time. We see queueing on the receiver (`queued: 46.68s` and `queued: 47.36s`) and there was presumably also queuing on the sender in the `raftsnapshot` queue.

Obviously, it's not a good idea to truncate a Raft log ahead of a leaseholder. It's especially bad to do so ahead of an _incoming_ leaseholder that hasn't yet found out that it has the lease. How can we improve this? I think there are two ways to approach this and we should explore both:
1. don't truncate our Raft log in front of an incoming leaseholder. We do our best not to truncate the Raft log in front of any follower, so something must have gone wrong here. We should page this back in and determine whether there's a missing safeguard. Is the INITIAL snapshot sent during upreplication subject to the same protection (see `snapshotLogTruncationConstraints`) against immediately log truncation that the raft snapshot queue is? It appears to be [here](https://github.com/cockroachdb/cockroach/blob/1c2f0be132d606eeb7ba6b6bfa26b96e045a17c1/pkg/kv/kvserver/replica_command.go#L1679).
2. prioritize Raft snapshots sent to leaseholder replicas at the various levels that a snapshot can queue (i.e. at sender and receiver). We don't seem to do this anywhere, but it seems like a good idea. If a leaseholder needs a snapshot, it needs it ASAP because the range will be unavailable (to writes in the general can and to reads+writes in cases like this) until it applies the snapshot.

/cc @cockroachdb/kv 

gz#9425

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unavailability due to lease transfer onto replica waiting for Raft snapshot #68539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unavailability due to lease transfer onto replica waiting for Raft snapshot #68539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions