kvserver: support non-voting replicas

Currently, CRDB only supports voting replicas: replicas that participate in quorum. This means that a high replication factor comes with the cost of high write latencies, since more replicas need to achieve consensus on every write. This trade-off is undesirable if the use-case only cares about using these replicas to serve follower reads, and not for fault tolerance. It is especially undesirable when the inter-replica latency between these voting replicas is high (i.e. when they are spread out across multiple distant regions). 

Non-voting replicas would follow the raft log (thus be able to serve follower reads), but would not participate in quorum. This means that they have, essentially, no impact on write latencies at the cost of not being able to be used for fault-tolerance. They unlock use-cases where, for instance, a quorum of voting replicas lives in North America, but we'd still like to serve low-latency follower reads in EU, Asia and Australia, without having this cripple write latencies.

This is a (rough) tracking issue for the major changes required to introduce non-voting replicas (also referred to as long-lived learners due to some implementation details). They are listed roughly in the order in which (I believe) they should be implemented. This issue will be updated as new considerations are discovered.

### New Replica type for non-voting replicas (done)
The desire for this stems from the assumptions made about learners being short lived in a couple of different places (covered below). There are valid reasons for some of those assumptions to stay in place for the short lived learners, but they don’t apply to long lived learners.
Pull requests: #53715

### Zone config syntax to enable creation of non-voters
Pull requests: #57184 

### Compatibility with range merges (done)
Currently, we [don’t merge a range](https://github.com/cockroachdb/cockroach/blob/1ba2821900b7cb0b24f333b471b8858730b28a5d/pkg/kv/kvserver/replica_command.go#L649) if it has a learner replica. In the `AdminMerge` transaction, if the LHS leaseholder detects learner replicas for either of the involved ranges, it just returns an error. It seems like we did it this way in order to avoid working out any potential edge cases since the learners were only supposed to be ephemeral. We’ll need to make merges work with (at least) long lived learners.
Pull requests: #56197 


### Interactions with raft snapshot queue and replicate queue (done)
The atomic replication change logic “manually” sends a snapshot to the short lived learner that it instantiates in the first phase of the change. The raft snapshot queue, thus, entirely ignores the learners it comes across. Once the long lived learners are added, they'll need to be able to receive snapshots from raft snapshot queue just like `VOTER` replicas.

Replicas are added by the atomic replication change logic in two phases, the first phase creates a short lived learner and the second promotes it to a voter. If the co-ordinating node dies before the second phase, we’re left with an orphaned learner replica. The replicate queue, thus, removes any learners it sees.

### Allocator level changes
We need to teach the allocator to obey the specified constraints to place long lived learners in the correct localities. Here is a preliminary, non-exhaustive list of considerations we need to be aware of while making this change:
- One heuristic we care a lot about when placing `VOTER` replicas across a cluster is diversity. The primary motivation for this is fault tolerance, which is not a consideration for learner replicas. We probably want to place learners primarily based on the localities of the incoming traffic (“follow the workload”).
- We currently do not include traffic served via follower reads in stats for load based splitting. We'll probably want to change that as we start seeing topologies that make heavy use of learner replicas.
- If all the nodes that qualify under the constraints specified for learners are ones that already have `VOTER` replicas on them, do we just silently omit these learners?

### Allow distsender to route follower read requests to learners
DistSender currently [ignores](https://github.com/cockroachdb/cockroach/blob/f2ec747c7ee8dfa737c1aff63aa7a9803c26b7ba/pkg/kv/kvclient/kvcoord/replica_slice.go#L69) learners when routing queries. We’ll want to start including long lived learners here if the request is such that it can be served via follower-reads.
Pull requests: [#58772](https://github.com/cockroachdb/cockroach/pull/58772)

### Roachtest that exercises & asserts use of follower reads
Currently, we lack large scale integration tests that test for “proper” use of follower reads (“are we correctly serving follower reads for a given compatible workload?”). We’ll want to write a roachtest that asserts, for a pre-determined workload, a minimum threshold of requests that need to be served via follower reads.

### Quota pool prevents raft leaders from getting too far ahead of followers
This means a slow learner can slow down regular raft traffic. We’ll want to consider whether we should allow long lived learners to potentially trail further behind than other types of replicas are allowed to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: support non-voting replicas #51943

New Replica type for non-voting replicas (done)

Zone config syntax to enable creation of non-voters

Compatibility with range merges (done)

Interactions with raft snapshot queue and replicate queue (done)

Allocator level changes

Allow distsender to route follower read requests to learners

Roachtest that exercises & asserts use of follower reads

Quota pool prevents raft leaders from getting too far ahead of followers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: support non-voting replicas #51943

Description

New Replica type for non-voting replicas (done)

Zone config syntax to enable creation of non-voters

Compatibility with range merges (done)

Interactions with raft snapshot queue and replicate queue (done)

Allocator level changes

Allow distsender to route follower read requests to learners

Roachtest that exercises & asserts use of follower reads

Quota pool prevents raft leaders from getting too far ahead of followers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions