kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot

Revival of https://github.com/cockroachdb/cockroach/issues/68539 and https://github.com/cockroachdb/cockroach/issues/61604.

A common form of prolonged availability loss is due to situations where a Range's leaseholder is in need of a Raft snapshot. During these situations, the Range's leaseholder (as defined by the replicated state machine) is not caught up on its log far enough to recognize that it holds the lease. As a result, every KV request to the leaseholder is met with a redirection to an earlier leaseholder, who in turn redirects back to the replica in need of a snapshot. However, even though the leaseholder does not recognize itself as such, it continues to heartbeat its liveness record, indirectly extending its lease so that it does not expire. The consequence of this situation is availability loss on the range until the leaseholder replica receives a snapshot.

### Understand

How does this situation happen? There are two ways that a replica can become the leaseholder. The first is through a non-cooperative `RequestLease`, where a replica acquires a lease for a range that does not currently have a leaseholder. The second is through a cooperative `LeaseTransfer`, where the current leaseholder passes the lease to another replica in its range.

`RequestLease` requests behave by proposing a request through Raft that, when committed, performs a [compare-and-swap](https://github.com/cockroachdb/cockroach/blob/73e72638f7f59737c5f8df92e6b043cd16c78fdc/pkg/kv/kvserver/replica_application_state_machine.go#L244) on the previous, expired lease. This lease acquisition request is bound to fail (because the lease that it's based on is stale), but while it fails (or, rather, until the behind replica finds out that it failed) local requests are blocked. In the past (https://github.com/cockroachdb/cockroach/issues/37906), this was observed to cause outages, as a replica that was behind on its log could propose a `RequestLease` and then block until it heard the result of the request, which required a snapshot. We've since resolved these issues in https://github.com/cockroachdb/cockroach/pull/55148, which prevented follower replicas from proposing `RequestLease` requests.

An important invariant in Raft is that a leader at the time of election is never behind on its log and in need of a snapshot. However, it is possible for a replica to think that it is the leader after it has already been replaced. This does leave a small margin where a leader could propose a `RequestLease` after it has been voted out. This would appear to be a problem. However, in practice it is not because the locking in `propBuf.FlushLockedWithRaftGroup` ensures that the `raftGroup` is never stepped between the leadership check and the proposal. This means that the `raftGroup` will always try to propose the `RequestLease` itself instead of forwarding it to the new leader. In such cases, the proposal must be rejected by the outgoing leader's peers. So the protection in https://github.com/cockroachdb/cockroach/pull/55148 is sound and `RequestLease` should never create the leaseholder-in-need-of-snapshot scenario.

`LeaseTransfer` requests are more like normal Raft log proposals. They are proposed "under" a current lease with a lease sequence and max lease applied index. This means that they can only be proposed by the current leaseholder and will be rejected if committed out of order (e.g. after the leaseholder has been replaced). Below Raft, `LeaseTransfer` requests can target any replica to assume the lease. However, the outgoing leaseholder contains a series of checks in [its allocator](https://github.com/cockroachdb/cockroach/blob/73e72638f7f59737c5f8df92e6b043cd16c78fdc/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go#L1484) and during [command evaluation](https://github.com/cockroachdb/cockroach/blob/73e72638f7f59737c5f8df92e6b043cd16c78fdc/pkg/kv/kvserver/batcheval/cmd_lease_transfer.go#L81) that attempt to ensure that this is a "good" lease target.

These checks are flawed and do not guarantee the protection we would like for three reasons:
1. they are incomplete. The check that the sender of the lease transfer is the leader takes place on some code paths, but not on others. This improved in https://github.com/cockroachdb/cockroach/commit/a6a8d5c1f703ad6dcbd18bebb3d985c5d83ec836, but there are still gaps in this protection. For instance, an `AdminTransferLeaseRequest` (used by some higher-level rebalancing code paths) bypasses this protection.
2. they are racy. While there are checks that consult the local raft status to check whether the replica is the raft leader and that the lease transfer target can catch up from this leader's log, this information may be stale by the time the lease transfer is proposed. Raft leadership may have moved by this point. Similarly, the raft log may also have been truncated.
3. they rely on a flawed assumption that a replica that can be caught up using the log by one raft leader could be caught up using the log by another raft leader, should leadership change. This has not been true since https://github.com/cockroachdb/cockroach/pull/35701.

A third potential avenue that could create this situation is if the raft log is truncated while already in a split leaseholder/leader situation. However, this is not possible in practice, as the leaseholder is always the replica that decides on the log truncation index, so it will never truncate above its own raft log. For added protection, we also [disable Raft log truncation](https://github.com/cockroachdb/cockroach/blob/73e72638f7f59737c5f8df92e6b043cd16c78fdc/pkg/kv/kvserver/raft_log_queue.go#L288) by leaseholders that are not also leaders.

Out of the three potential cases that could lead to a leaseholder needing a snapshot, two are guaranteed today to not create such a situation. However, lease transfers have insufficient protection against creating this situation. We should fix them.

### Prevent

To prevent this situation, we need firmer guarantees during leaseholder transfers. https://github.com/cockroachdb/cockroach/pull/55148 provides a blueprint through which to think about protections that are both complete and non-racy. The protection added #55148 is built into the raft `propBuf` and run within the Raft state machine loop. This ensures that it applies to all Raft proposals (and re-proposals) and has an accurate understanding of Raft leadership (or else the proposal will be rejected).

We should do something similar for lease transfers. We should add a check into `propBuf.FlushLockedWithRaftGroup` that only allows the Raft leader to propose lease transfers and only to replicas who are 1) in `StateReplicate` and 2) have a `Match` index that is > the leaseholder's understanding of the Raft log's truncated index. Latching on the leaseholder will ensure that log truncation and lease transfers are properly synchronized, so that any log truncation request immediately before a lease transfer is accounted for in the protection.

We should also consider reverting https://github.com/cockroachdb/cockroach/pull/35701, or otherwise finding a way to ensure that a leaseholder's view of the log truncation index is an accurate upper bound of the earlier log index than any future Raft leader might contain. Otherwise, we are still susceptible to a leadership change re-introducing the need to snapshot the leaseholder.

### Recover

Even with what we believe to be firm guarantees against a leaseholder getting into this state, we should still optimize for recovery from it, given the severity of any mistake. This kind of improved recovery likely includes sender-side prioritization of snapshots to leaseholders, which is underway in https://github.com/cockroachdb/cockroach/pull/80817.

It may also include:
- receiver-side prioritization of snapshots to leaseholders
- higher snapshot rate limits in these cases

### Action Items

- [x] https://github.com/cockroachdb/cockroach/issues/81763
- [x] https://github.com/cockroachdb/cockroach/issues/81764

Jira issue: CRDB-15077

Epic CRDB-16160

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Understand

Prevent

Recover

Action Items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Description

Understand

Prevent

Recover

Action Items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions