Skip to content

kvserver: support non-voting replicas #51943

@aayushshah15

Description

@aayushshah15

Currently, CRDB only supports voting replicas: replicas that participate in quorum. This means that a high replication factor comes with the cost of high write latencies, since more replicas need to achieve consensus on every write. This trade-off is undesirable if the use-case only cares about using these replicas to serve follower reads, and not for fault tolerance. It is especially undesirable when the inter-replica latency between these voting replicas is high (i.e. when they are spread out across multiple distant regions).

Non-voting replicas would follow the raft log (thus be able to serve follower reads), but would not participate in quorum. This means that they have, essentially, no impact on write latencies at the cost of not being able to be used for fault-tolerance. They unlock use-cases where, for instance, a quorum of voting replicas lives in North America, but we'd still like to serve low-latency follower reads in EU, Asia and Australia, without having this cripple write latencies.

This is a (rough) tracking issue for the major changes required to introduce non-voting replicas (also referred to as long-lived learners due to some implementation details). They are listed roughly in the order in which (I believe) they should be implemented. This issue will be updated as new considerations are discovered.

New Replica type for non-voting replicas (done)

The desire for this stems from the assumptions made about learners being short lived in a couple of different places (covered below). There are valid reasons for some of those assumptions to stay in place for the short lived learners, but they don’t apply to long lived learners.
Pull requests: #53715

Zone config syntax to enable creation of non-voters

Pull requests: #57184

Compatibility with range merges (done)

Currently, we don’t merge a range if it has a learner replica. In the AdminMerge transaction, if the LHS leaseholder detects learner replicas for either of the involved ranges, it just returns an error. It seems like we did it this way in order to avoid working out any potential edge cases since the learners were only supposed to be ephemeral. We’ll need to make merges work with (at least) long lived learners.
Pull requests: #56197

Interactions with raft snapshot queue and replicate queue (done)

The atomic replication change logic “manually” sends a snapshot to the short lived learner that it instantiates in the first phase of the change. The raft snapshot queue, thus, entirely ignores the learners it comes across. Once the long lived learners are added, they'll need to be able to receive snapshots from raft snapshot queue just like VOTER replicas.

Replicas are added by the atomic replication change logic in two phases, the first phase creates a short lived learner and the second promotes it to a voter. If the co-ordinating node dies before the second phase, we’re left with an orphaned learner replica. The replicate queue, thus, removes any learners it sees.

Allocator level changes

We need to teach the allocator to obey the specified constraints to place long lived learners in the correct localities. Here is a preliminary, non-exhaustive list of considerations we need to be aware of while making this change:

  • One heuristic we care a lot about when placing VOTER replicas across a cluster is diversity. The primary motivation for this is fault tolerance, which is not a consideration for learner replicas. We probably want to place learners primarily based on the localities of the incoming traffic (“follow the workload”).
  • We currently do not include traffic served via follower reads in stats for load based splitting. We'll probably want to change that as we start seeing topologies that make heavy use of learner replicas.
  • If all the nodes that qualify under the constraints specified for learners are ones that already have VOTER replicas on them, do we just silently omit these learners?

Allow distsender to route follower read requests to learners

DistSender currently ignores learners when routing queries. We’ll want to start including long lived learners here if the request is such that it can be served via follower-reads.
Pull requests: #58772

Roachtest that exercises & asserts use of follower reads

Currently, we lack large scale integration tests that test for “proper” use of follower reads (“are we correctly serving follower reads for a given compatible workload?”). We’ll want to write a roachtest that asserts, for a pre-determined workload, a minimum threshold of requests that need to be served via follower reads.

Quota pool prevents raft leaders from getting too far ahead of followers

This means a slow learner can slow down regular raft traffic. We’ll want to consider whether we should allow long lived learners to potentially trail further behind than other types of replicas are allowed to.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions